Stateful Transformers Enable 5.9x Faster Streaming Inference
Key Takeaways
- ▸Stateful sessions with persistent KV caches reduce query latency to O(|q|), independent of context size—a major breakthrough for streaming applications
- ▸Flash Queries enable speculative execution, returning answers before users ask by leveraging idle GPU cycles, structurally impossible in stateless designs
- ▸Achieves 5.9x speedup over vLLM, SGLang, and TensorRT-LLM on market-data workloads while maintaining full multi-tenant scalability
Summary
A new research paper published on arXiv demonstrates a novel approach to transformer inference that dramatically improves performance for streaming workloads. The work introduces 'stateful sessions' that maintain a persistent KV cache updated incrementally as new data arrives, eliminating the expensive O(n) prefill cost associated with conventional request-driven inference engines. This architectural shift moves query latency from O(n) context-dependent to O(|q|) query-only, making latency independent of accumulated context size.
The researchers further introduce 'Flash Queries,' a technique that reclaims idle GPU cycles between data arrivals to pre-evaluate registered questions and return cached answers before users even ask. This pattern is structurally impossible in stateless inference engines that discard intermediate state between requests. The proposed system employs a multi-tenant continuous-batching scheduler with cell-budget admission and prefix-aware grouped prefill, enabling dozens of stateful sessions to coexist on a single GPU while preserving full quadratic self-attention complexity.
On streaming market-data benchmarks, the reference implementation achieves up to 5.9x speedup compared to state-of-the-art engines including vLLM, SGLang, and TensorRT-LLM, while holding query latency constant as context grows. This work challenges the fundamental architecture of current production inference systems and suggests a path toward more efficient real-time AI applications.
Editorial Opinion
This research represents a fundamental rethinking of transformer inference architecture, moving from the stateless, request-driven paradigm that has dominated since Transformers were introduced. The combination of stateful sessions and Flash Queries is elegant and addresses a real pain point: the growing cost of prefilling large contexts in streaming applications. If these results hold up in broader deployments, this could catalyze a wave of architectural changes across inference engines, particularly for real-time applications like financial markets, live translation, and continuous monitoring systems. The 5.9x speedup is significant, but the conceptual shift—treating inference as data-driven and stateful rather than query-driven and stateless—may prove to be the more important contribution.



