VeriCache: New Framework Enables Lossless Compression for KV Cache in LLM Inference
Key Takeaways
- ▸VeriCache guarantees identical outputs to full-KV-cache decoding while preserving the throughput benefits of compression techniques
- ▸Solves the practical hardware challenge of swapping full KV cache between GPU and system memory by exploiting different bottlenecks in parallel
- ▸Achieves up to 4X throughput improvement over traditional full-KV inference without output degradation
Summary
VeriCache is a new inference framework that solves a fundamental limitation of KV cache compression methods used in large language models. While popular compression techniques like token dropping and quantization reduce memory overhead, they are inherently lossy—causing outputs to increasingly diverge from full-KV-cache inference as more tokens are generated, leading to catastrophic failures in code generation and tool use. VeriCache introduces a verification-based approach that drafts tokens using compressed KV cache and verifies them against the full KV cache, ensuring identical outputs while maintaining high throughput. The framework addresses the critical system challenge of keeping the full KV cache out of GPU memory by parallelizing compressed-KV decoding (HBM-bandwidth-bound) with full-KV swapping (PCIe/network-bound). Experimental results demonstrate up to 4X throughput improvement over full-KV inference without sacrificing output fidelity, with applicability across long-context decoding, remote prefix caching, and compatibility with both token-dropping and quantization methods.
- Supports a broad family of compression methods and composes with existing speculative decoding approaches
Editorial Opinion
VeriCache represents an important breakthrough in making KV cache compression production-ready. By decoupling the benefits of compression from output correctness guarantees, this work addresses one of the most critical pain points in serving long-context LLMs at scale. The elegant insight of parallelizing memory-bound and I/O-bound operations could make this a foundational technique adopted across LLM serving infrastructure. If the implementation generalizes as claimed, this could fundamentally change how companies approach the cost-quality tradeoff in context length scaling.



