BotBeat
...
← Back

> ▌

HuaweiHuawei
OPEN SOURCEHuawei2026-06-04

Huawei Releases KVarN: Open-Source KV-Cache Quantization Accelerator for vLLM

Key Takeaways

  • ▸KVarN achieves 3-5× more KV-cache capacity while maintaining FP16-level accuracy—solving the accuracy-throughput tradeoff that blocked previous quantization methods
  • ▸Calibration-free and plug-and-play integration: add one vLLM flag (–kv-cache-dtype kvarn_k4v2_g128) with no model changes or calibration required
  • ▸FP16-level or better throughput: delivers ~1.3× the throughput of FP16 while providing several times the cache—a rare combination that keeps inference fast
Source:
Hacker Newshttps://github.com/huawei-csl/KVarN↗

Summary

Huawei has open-sourced KVarN, a native vLLM backend that dramatically improves KV-cache efficiency for large language models through variance-normalized quantization. The tool delivers 3-5× more KV-cache capacity while maintaining FP16-level accuracy and throughput, solving a long-standing production challenge: previous KV-cache quantization methods traded throughput for capacity, making them impractical for real-world deployments.

KVarN works by applying four quantization stages to fixed-size token tiles: Hadamard rotation mixes channels to spread outliers, iterative variance normalization equalizes tile variance in log-space, and asymmetric low-bit quantization compresses keys (4-bit) and values (2-bit). The technique is calibration-free and requires only a single flag to enable in vLLM—no model changes or complex setup needed.

The release addresses growing demand for longer-context and agentic workloads, which require massive KV-cache allocations. Benchmarked on Qwen3-32B, KVarN achieves FP16-level accuracy while delivering ~4× the cache capacity and equal or better throughput than unquantized models. The implementation ships as a vLLM fork and can be installed via pip, with serving enabled by a single command-line flag.

KVarN is backed by peer-reviewed research (arXiv:2606.03458) and is available on GitHub at https://github.com/huawei-csl/KVarN. The project targets production deployments of long-context and reasoning-heavy applications.

  • 2.4× faster than TurboQuant (prior SOTA) while maintaining the same cache capacity and higher accuracy
  • Open-sourced on GitHub with Triton JIT-compiled kernels, designed for agentic and long-context production workloads
Large Language Models (LLMs)Machine LearningMLOps & InfrastructureOpen Source

More from Huawei

HuaweiHuawei
RESEARCH

Huawei Unveils LogicFolding Architecture to Compete Despite US Semiconductor Sanctions

2026-05-26
HuaweiHuawei
RESEARCH

China Deploys 1.54-Exaflops LineShine Supercomputer to Circumvent US GPU Restrictions

2026-05-17
HuaweiHuawei
INDUSTRY REPORT

Huawei dominates China's AI chip market as revenue surges to $12B, ousting Nvidia

2026-05-06

Comments

Suggested

AnthropicAnthropic
INDUSTRY REPORT

Sentry Moves 2,500 Pages Out of CMS Using Claude Code Agents

2026-06-04
TokkeyCCTokkeyCC
PRODUCT LAUNCH

TokkeyCC Launches OpenAI-Compatible API Aggregating 100+ AI Models at Competitive Pricing

2026-06-04
AnthropicAnthropic
RESEARCH

Anthropic's Internal Data Shows Claude Accelerating AI Development, Moving Toward Possible Recursive Self-Improvement

2026-06-04
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us