TurboQuant Plus Achieves 22% Decode Speedup Through Sparse V Dequantization, Maintains q8_0 Performance at 4.6x Compression

Key Takeaways

▸Sparse V dequantization skips 90%+ of unnecessary attention weight calculations at long context, delivering 22% decode speedup with zero perplexity impact
▸4.6x KV cache compression maintained with speed parity to q8_0 baseline (2747 vs 2694 tok/s prefill on M5 Max)
▸Fully integrated into llama.cpp with Metal GPU kernels; community-tested across 10+ testers on diverse Apple Silicon and discrete GPU hardware

Source:

Hacker Newshttps://github.com/TheTom/turboquant_plus↗

Summary

TurboQuant Plus, an advanced KV cache compression technique for local LLM inference, has achieved significant performance improvements through sparse V dequantization optimization. The method compresses transformer KV cache 4.6x while maintaining speed parity with standard q8_0 quantization on Apple Silicon, with a novel sparse dequantization approach that skips attention weight calculations below 1e-6 threshold, yielding 22.8% decode speedup at 32K context length. The implementation, now fully integrated into llama.cpp with Metal GPU kernels, demonstrates zero quality loss (perplexity delta of only 1%) and achieves 100% accuracy on NIAH retrieval benchmarks.

The sparse V dequantization optimization—a three-line kernel change—proves particularly effective at long context lengths where 90%+ of attention weights are negligible, saving approximately half the total dequantization cost. Testing across diverse hardware (Apple M1-M5, RTX 30/40/50 series, AMD 6800/9070) and comprehensive evaluation (511+ Python tests, 14 decode approaches benchmarked) confirm the technique's robustness. The "Plus" designation signals planned post-v1 improvements including adaptive bit allocation, temporal decay compression, and expert-aware MoE optimizations.

Sparse V optimization is hardware-agnostic; 5% speedup confirmed on standard q8_0, indicating general attention-aware optimization value beyond TurboQuant

Editorial Opinion

TurboQuant Plus represents a pragmatic engineering advancement in local LLM inference, solving the real problem of context-dependent performance degradation through elegant sparse computation rather than algorithmic complexity. The 22% decode improvement at extreme context lengths (32K tokens) addresses a genuine pain point for document processing workflows, and the three-line kernel implementation suggests the optimization could be readily adopted by other quantization schemes. However, the 10% decode penalty at short context and modest 1% perplexity cost indicate this remains a compression-speed tradeoff rather than a pure win—practitioners must validate against their specific use cases.

TurboQuant Plus Achieves 22% Decode Speedup Through Sparse V Dequantization, Maintains q8_0 Performance at 4.6x Compression

Key Takeaways

▸Sparse V dequantization skips 90%+ of unnecessary attention weight calculations at long context, delivering 22% decode speedup with zero perplexity impact
▸4.6x KV cache compression maintained with speed parity to q8_0 baseline (2747 vs 2694 tok/s prefill on M5 Max)
▸Fully integrated into llama.cpp with Metal GPU kernels; community-tested across 10+ testers on diverse Apple Silicon and discrete GPU hardware

Summary

Sparse V optimization is hardware-agnostic; 5% speedup confirmed on standard q8_0, indicating general attention-aware optimization value beyond TurboQuant

Editorial Opinion

TurboQuant Plus represents a pragmatic engineering advancement in local LLM inference, solving the real problem of context-dependent performance degradation through elegant sparse computation rather than algorithmic complexity. The 22% decode improvement at extreme context lengths (32K tokens) addresses a genuine pain point for document processing workflows, and the three-line kernel implementation suggests the optimization could be readily adopted by other quantization schemes. However, the 10% decode penalty at short context and modest 1% perplexity cost indicate this remains a compression-speed tradeoff rather than a pure win—practitioners must validate against their specific use cases.

TurboQuant Plus Achieves 22% Decode Speedup Through Sparse V Dequantization, Maintains q8_0 Performance at 4.6x Compression

Key Takeaways

Summary

Editorial Opinion

More from Independent Developer

CrankGPT: A Fully Offline, Hand-Powered AI Assistant

reasoning-core: Open-Source 130M-Param Guardrail Cuts AI Agent Token Usage by Up to 29%

The 'Google for AI Agents' Is Coming – and It's Being Built Outside Big Tech

Comments

Suggested

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Researchers Expose Critical Payload-Less Attack on LLM Agent Supply Chains

Meta Acknowledges AI Agent Development Slower Than Expected, Despite $145B Infrastructure Investment

TurboQuant Plus Achieves 22% Decode Speedup Through Sparse V Dequantization, Maintains q8_0 Performance at 4.6x Compression

Key Takeaways

Summary

Editorial Opinion

More from Independent Developer

CrankGPT: A Fully Offline, Hand-Powered AI Assistant

reasoning-core: Open-Source 130M-Param Guardrail Cuts AI Agent Token Usage by Up to 29%

The 'Google for AI Agents' Is Coming – and It's Being Built Outside Big Tech

Comments

Suggested

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Researchers Expose Critical Payload-Less Attack on LLM Agent Supply Chains

Meta Acknowledges AI Agent Development Slower Than Expected, Despite $145B Infrastructure Investment