TurboQuant Plus Achieves 22% Decode Speedup Through Sparse V Dequantization, Maintains q8_0 Performance at 4.6x Compression
Key Takeaways
- ▸Sparse V dequantization skips 90%+ of unnecessary attention weight calculations at long context, delivering 22% decode speedup with zero perplexity impact
- ▸4.6x KV cache compression maintained with speed parity to q8_0 baseline (2747 vs 2694 tok/s prefill on M5 Max)
- ▸Fully integrated into llama.cpp with Metal GPU kernels; community-tested across 10+ testers on diverse Apple Silicon and discrete GPU hardware
Summary
TurboQuant Plus, an advanced KV cache compression technique for local LLM inference, has achieved significant performance improvements through sparse V dequantization optimization. The method compresses transformer KV cache 4.6x while maintaining speed parity with standard q8_0 quantization on Apple Silicon, with a novel sparse dequantization approach that skips attention weight calculations below 1e-6 threshold, yielding 22.8% decode speedup at 32K context length. The implementation, now fully integrated into llama.cpp with Metal GPU kernels, demonstrates zero quality loss (perplexity delta of only 1%) and achieves 100% accuracy on NIAH retrieval benchmarks.
The sparse V dequantization optimization—a three-line kernel change—proves particularly effective at long context lengths where 90%+ of attention weights are negligible, saving approximately half the total dequantization cost. Testing across diverse hardware (Apple M1-M5, RTX 30/40/50 series, AMD 6800/9070) and comprehensive evaluation (511+ Python tests, 14 decode approaches benchmarked) confirm the technique's robustness. The "Plus" designation signals planned post-v1 improvements including adaptive bit allocation, temporal decay compression, and expert-aware MoE optimizations.
- Sparse V optimization is hardware-agnostic; 5% speedup confirmed on standard q8_0, indicating general attention-aware optimization value beyond TurboQuant
Editorial Opinion
TurboQuant Plus represents a pragmatic engineering advancement in local LLM inference, solving the real problem of context-dependent performance degradation through elegant sparse computation rather than algorithmic complexity. The 22% decode improvement at extreme context lengths (32K tokens) addresses a genuine pain point for document processing workflows, and the three-line kernel implementation suggests the optimization could be readily adopted by other quantization schemes. However, the 10% decode penalty at short context and modest 1% perplexity cost indicate this remains a compression-speed tradeoff rather than a pure win—practitioners must validate against their specific use cases.



