BotBeat
...
← Back

> ▌

Independent DeveloperIndependent Developer
RESEARCHIndependent Developer2026-03-27

TurboQuant Plus Achieves 22% Decode Speedup Through Sparse V Dequantization, Maintains q8_0 Performance at 4.6x Compression

Key Takeaways

  • ▸Sparse V dequantization skips 90%+ of unnecessary attention weight calculations at long context, delivering 22% decode speedup with zero perplexity impact
  • ▸4.6x KV cache compression maintained with speed parity to q8_0 baseline (2747 vs 2694 tok/s prefill on M5 Max)
  • ▸Fully integrated into llama.cpp with Metal GPU kernels; community-tested across 10+ testers on diverse Apple Silicon and discrete GPU hardware
Source:
Hacker Newshttps://github.com/TheTom/turboquant_plus↗

Summary

TurboQuant Plus, an advanced KV cache compression technique for local LLM inference, has achieved significant performance improvements through sparse V dequantization optimization. The method compresses transformer KV cache 4.6x while maintaining speed parity with standard q8_0 quantization on Apple Silicon, with a novel sparse dequantization approach that skips attention weight calculations below 1e-6 threshold, yielding 22.8% decode speedup at 32K context length. The implementation, now fully integrated into llama.cpp with Metal GPU kernels, demonstrates zero quality loss (perplexity delta of only 1%) and achieves 100% accuracy on NIAH retrieval benchmarks.

The sparse V dequantization optimization—a three-line kernel change—proves particularly effective at long context lengths where 90%+ of attention weights are negligible, saving approximately half the total dequantization cost. Testing across diverse hardware (Apple M1-M5, RTX 30/40/50 series, AMD 6800/9070) and comprehensive evaluation (511+ Python tests, 14 decode approaches benchmarked) confirm the technique's robustness. The "Plus" designation signals planned post-v1 improvements including adaptive bit allocation, temporal decay compression, and expert-aware MoE optimizations.

  • Sparse V optimization is hardware-agnostic; 5% speedup confirmed on standard q8_0, indicating general attention-aware optimization value beyond TurboQuant

Editorial Opinion

TurboQuant Plus represents a pragmatic engineering advancement in local LLM inference, solving the real problem of context-dependent performance degradation through elegant sparse computation rather than algorithmic complexity. The 22% decode improvement at extreme context lengths (32K tokens) addresses a genuine pain point for document processing workflows, and the three-line kernel implementation suggests the optimization could be readily adopted by other quantization schemes. However, the 10% decode penalty at short context and modest 1% perplexity cost indicate this remains a compression-speed tradeoff rather than a pure win—practitioners must validate against their specific use cases.

Large Language Models (LLMs)Machine LearningMLOps & InfrastructureAI HardwareOpen Source

More from Independent Developer

Independent DeveloperIndependent Developer
RESEARCH

New 25-Question SQL Benchmark for Evaluating Agentic LLM Performance

2026-04-02
Independent DeveloperIndependent Developer
RESEARCH

Developer Teaches AIs to Use SDKs: Testing Shows AI and Human Developer Experience Are Fundamentally Different

2026-03-31
Independent DeveloperIndependent Developer
OPEN SOURCE

Prompt Guard: Open-Source MITM Proxy Blocks Sensitive Data From Reaching AI APIs

2026-03-26

Comments

Suggested

Google / AlphabetGoogle / Alphabet
RESEARCH

Deep Dive: Optimizing Sharded Matrix Multiplication on TPU with Pallas

2026-04-05
GitHubGitHub
PRODUCT LAUNCH

GitHub Launches Squad: Open Source Multi-Agent AI Framework to Simplify Complex Workflows

2026-04-05
NVIDIANVIDIA
RESEARCH

Nvidia Pivots to Optical Interconnects as Copper Hits Physical Limits, Plans 1,000+ GPU Systems by 2028

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us