BotBeat
...
← Back

> ▌

Independent DeveloperIndependent Developer
RESEARCHIndependent Developer2026-03-27

TurboQuant Plus Achieves 22% Decode Speedup Through Sparse V Dequantization, Maintains q8_0 Performance at 4.6x Compression

Key Takeaways

  • ▸Sparse V dequantization skips 90%+ of unnecessary attention weight calculations at long context, delivering 22% decode speedup with zero perplexity impact
  • ▸4.6x KV cache compression maintained with speed parity to q8_0 baseline (2747 vs 2694 tok/s prefill on M5 Max)
  • ▸Fully integrated into llama.cpp with Metal GPU kernels; community-tested across 10+ testers on diverse Apple Silicon and discrete GPU hardware
Source:
Hacker Newshttps://github.com/TheTom/turboquant_plus↗

Summary

TurboQuant Plus, an advanced KV cache compression technique for local LLM inference, has achieved significant performance improvements through sparse V dequantization optimization. The method compresses transformer KV cache 4.6x while maintaining speed parity with standard q8_0 quantization on Apple Silicon, with a novel sparse dequantization approach that skips attention weight calculations below 1e-6 threshold, yielding 22.8% decode speedup at 32K context length. The implementation, now fully integrated into llama.cpp with Metal GPU kernels, demonstrates zero quality loss (perplexity delta of only 1%) and achieves 100% accuracy on NIAH retrieval benchmarks.

The sparse V dequantization optimization—a three-line kernel change—proves particularly effective at long context lengths where 90%+ of attention weights are negligible, saving approximately half the total dequantization cost. Testing across diverse hardware (Apple M1-M5, RTX 30/40/50 series, AMD 6800/9070) and comprehensive evaluation (511+ Python tests, 14 decode approaches benchmarked) confirm the technique's robustness. The "Plus" designation signals planned post-v1 improvements including adaptive bit allocation, temporal decay compression, and expert-aware MoE optimizations.

  • Sparse V optimization is hardware-agnostic; 5% speedup confirmed on standard q8_0, indicating general attention-aware optimization value beyond TurboQuant

Editorial Opinion

TurboQuant Plus represents a pragmatic engineering advancement in local LLM inference, solving the real problem of context-dependent performance degradation through elegant sparse computation rather than algorithmic complexity. The 22% decode improvement at extreme context lengths (32K tokens) addresses a genuine pain point for document processing workflows, and the three-line kernel implementation suggests the optimization could be readily adopted by other quantization schemes. However, the 10% decode penalty at short context and modest 1% perplexity cost indicate this remains a compression-speed tradeoff rather than a pure win—practitioners must validate against their specific use cases.

Large Language Models (LLMs)Machine LearningMLOps & InfrastructureAI HardwareOpen Source

More from Independent Developer

Independent DeveloperIndependent Developer
OPEN SOURCE

reasoning-core: Open-Source 130M-Param Guardrail Cuts AI Agent Token Usage by Up to 29%

2026-05-13
Independent DeveloperIndependent Developer
PRODUCT LAUNCH

The 'Google for AI Agents' Is Coming – and It's Being Built Outside Big Tech

2026-04-20
Independent DeveloperIndependent Developer
OPEN SOURCE

CTO Open-Sources Hands-On Neural Network Building Method

2026-04-14

Comments

Suggested

AnthropicAnthropic
PARTNERSHIP

Anthropic Expands Partnership with SpaceX, Scales GB200 Capacity in Colossus 2

2026-05-20
Research CommunityResearch Community
RESEARCH

New Methodology Proposed for Selecting Runtime Architecture Patterns in Production LLM Agents

2026-05-20
NVIDIANVIDIA
FUNDING & BUSINESS

NVIDIA Reports Record $81.6B Revenue in Q1 FY2027, Data Center Segment Surges 92% YoY

2026-05-20
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us