TorchSpec: New Framework Enables Efficient Speculative Decoding Training at Scale

Key Takeaways

▸TorchSpec disaggregates inference and training pipelines, streaming hidden states via RDMA/TCP instead of disk storage or co-location
▸Successfully trained draft models at scale for Kimi K2.5 with 600k samples and 6 billion tokens, delivering 26-60% throughput gains
▸Enables independent scaling of inference and training resources, overcoming limitations of existing co-located and offline hidden-state approaches

Source:

Hacker Newshttps://pytorch.org/blog/torchspec-speculative-decoding-training-at-scale/↗

Summary

Researchers have introduced TorchSpec, a torch-native framework designed to solve a critical bottleneck in training draft models for speculative decoding—a key technique for accelerating large language model inference. As frontier LLMs grow to hundreds of billions of parameters with million-token context windows, the volume of hidden states needed during draft model training has become prohibitively large, creating storage and computational challenges.

TorchSpec addresses this by disaggregating the inference and training systems, streaming hidden states directly from inference engines to training workers via RDMA or TCP through a central Mooncake store, rather than storing them on disk or co-locating inference and training on shared GPUs. This design allows inference and training resources to scale independently while eliminating disk I/O bottlenecks. The framework successfully trained an EAGLE-3 draft model for Kimi K2.5 using 1500 H200 GPU hours across 600k training samples and 6 billion tokens, achieving over 60% throughput improvement at batch size 1 and 26-30% improvements at larger batch sizes.

Addresses a critical systems bottleneck as frontier LLMs expand to hundreds of billions of parameters with million-token contexts

Editorial Opinion

TorchSpec represents a significant engineering contribution to making frontier LLM deployment more practical and cost-effective. By solving the hidden-state transfer bottleneck through disaggregated infrastructure, the framework could accelerate adoption of speculative decoding across the industry and unlock new possibilities for efficient inference at scale. The independent scalability of inference and training resources is particularly valuable for organizations managing heterogeneous compute clusters.

TorchSpec: New Framework Enables Efficient Speculative Decoding Training at Scale

Key Takeaways

▸TorchSpec disaggregates inference and training pipelines, streaming hidden states via RDMA/TCP instead of disk storage or co-location
▸Successfully trained draft models at scale for Kimi K2.5 with 600k samples and 6 billion tokens, delivering 26-60% throughput gains
▸Enables independent scaling of inference and training resources, overcoming limitations of existing co-located and offline hidden-state approaches

Summary

Addresses a critical systems bottleneck as frontier LLMs expand to hundreds of billions of parameters with million-token contexts

Editorial Opinion

TorchSpec represents a significant engineering contribution to making frontier LLM deployment more practical and cost-effective. By solving the hidden-state transfer bottleneck through disaggregated infrastructure, the framework could accelerate adoption of speculative decoding across the industry and unlock new possibilities for efficient inference at scale. The independent scalability of inference and training resources is particularly valuable for organizations managing heterogeneous compute clusters.

TorchSpec: New Framework Enables Efficient Speculative Decoding Training at Scale

Key Takeaways

Summary

Editorial Opinion

Comments

Suggested

Microsoft's Leaked 'Aion' Project Reveals Vision for Copilot-First Operating System

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Researchers Expose Critical Payload-Less Attack on LLM Agent Supply Chains

TorchSpec: New Framework Enables Efficient Speculative Decoding Training at Scale

Key Takeaways

Summary

Editorial Opinion

Comments

Suggested

Microsoft's Leaked 'Aion' Project Reveals Vision for Copilot-First Operating System

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Researchers Expose Critical Payload-Less Attack on LLM Agent Supply Chains