TorchSpec: New Framework Enables Efficient Speculative Decoding Training at Scale
Key Takeaways
- ▸TorchSpec disaggregates inference and training pipelines, streaming hidden states via RDMA/TCP instead of disk storage or co-location
- ▸Successfully trained draft models at scale for Kimi K2.5 with 600k samples and 6 billion tokens, delivering 26-60% throughput gains
- ▸Enables independent scaling of inference and training resources, overcoming limitations of existing co-located and offline hidden-state approaches
Summary
Researchers have introduced TorchSpec, a torch-native framework designed to solve a critical bottleneck in training draft models for speculative decoding—a key technique for accelerating large language model inference. As frontier LLMs grow to hundreds of billions of parameters with million-token context windows, the volume of hidden states needed during draft model training has become prohibitively large, creating storage and computational challenges.
TorchSpec addresses this by disaggregating the inference and training systems, streaming hidden states directly from inference engines to training workers via RDMA or TCP through a central Mooncake store, rather than storing them on disk or co-locating inference and training on shared GPUs. This design allows inference and training resources to scale independently while eliminating disk I/O bottlenecks. The framework successfully trained an EAGLE-3 draft model for Kimi K2.5 using 1500 H200 GPU hours across 600k training samples and 6 billion tokens, achieving over 60% throughput improvement at batch size 1 and 26-30% improvements at larger batch sizes.
- Addresses a critical systems bottleneck as frontier LLMs expand to hundreds of billions of parameters with million-token contexts
Editorial Opinion
TorchSpec represents a significant engineering contribution to making frontier LLM deployment more practical and cost-effective. By solving the hidden-state transfer bottleneck through disaggregated infrastructure, the framework could accelerate adoption of speculative decoding across the industry and unlock new possibilities for efficient inference at scale. The independent scalability of inference and training resources is particularly valuable for organizations managing heterogeneous compute clusters.



