Huawei Releases KVarN: Open-Source KV-Cache Quantization Accelerator for vLLM
Key Takeaways
- ▸KVarN achieves 3-5× more KV-cache capacity while maintaining FP16-level accuracy—solving the accuracy-throughput tradeoff that blocked previous quantization methods
- ▸Calibration-free and plug-and-play integration: add one vLLM flag (–kv-cache-dtype kvarn_k4v2_g128) with no model changes or calibration required
- ▸FP16-level or better throughput: delivers ~1.3× the throughput of FP16 while providing several times the cache—a rare combination that keeps inference fast
Summary
Huawei has open-sourced KVarN, a native vLLM backend that dramatically improves KV-cache efficiency for large language models through variance-normalized quantization. The tool delivers 3-5× more KV-cache capacity while maintaining FP16-level accuracy and throughput, solving a long-standing production challenge: previous KV-cache quantization methods traded throughput for capacity, making them impractical for real-world deployments.
KVarN works by applying four quantization stages to fixed-size token tiles: Hadamard rotation mixes channels to spread outliers, iterative variance normalization equalizes tile variance in log-space, and asymmetric low-bit quantization compresses keys (4-bit) and values (2-bit). The technique is calibration-free and requires only a single flag to enable in vLLM—no model changes or complex setup needed.
The release addresses growing demand for longer-context and agentic workloads, which require massive KV-cache allocations. Benchmarked on Qwen3-32B, KVarN achieves FP16-level accuracy while delivering ~4× the cache capacity and equal or better throughput than unquantized models. The implementation ships as a vLLM fork and can be installed via pip, with serving enabled by a single command-line flag.
KVarN is backed by peer-reviewed research (arXiv:2606.03458) and is available on GitHub at https://github.com/huawei-csl/KVarN. The project targets production deployments of long-context and reasoning-heavy applications.
- 2.4× faster than TurboQuant (prior SOTA) while maintaining the same cache capacity and higher accuracy
- Open-sourced on GitHub with Triton JIT-compiled kernels, designed for agentic and long-context production workloads


