Research Challenges Core Transformer Design: Study Shows Three QKV Projections May Be Unnecessary
Key Takeaways
- ▸The standard three-projection architecture is over-parameterized; projection sharing achieves comparable performance with dramatically lower memory requirements
- ▸Q-K=V reduces KV cache by 50% with minimal performance cost (3.1% perplexity loss), practical for deployment on memory-constrained devices
- ▸Combined with MQA, Q-K=V achieves 96.9% cache reduction, enabling on-device inference for large language models
Summary
A systematic study questions a foundational design choice in transformer architecture, demonstrating that the standard three-projection Query-Key-Value (QKV) approach may be over-engineered. Researchers tested variants with shared projections—including a single projection for all three—and found comparable or superior performance across vision and language tasks compared to standard transformers.
Key results show that sharing key-value projections (Q-K=V variant) achieves 50% KV cache reduction with only 3.1% perplexity degradation in language models. The gains amplify when combined with existing optimization techniques: Q-K=V paired with Multi-Query Attention (MQA) achieves 96.9% cache reduction, making practical on-device inference feasible for large language models. Experiments spanning synthetic tasks, vision benchmarks (MNIST, CIFAR, TinyImageNet, anomaly detection), and language modeling (300M and 1.2B parameter models on 10B tokens) reveal that transformers operate in low-rank regimes where keys and values effectively share representational space.
The research characterizes projection sharing as an underexplored weight-tying technique with direct memory benefits. Importantly, not all simplifications work equally—the single-projection variant (Q=K=V) breaks attention directionality—but asymmetric attention via 2D positional encodings can recover performance. Code is publicly available, offering practitioners immediate access to these optimization strategies.
- Transformers operate in low-rank regimes where keys and values occupy similar representational spaces, explaining projection sharing's effectiveness
Editorial Opinion
This research reveals significant underexploited inefficiencies in production transformer designs. The systematic evaluation is refreshing—too often architectural choices become dogma—and the practical benefits for edge deployment are substantial. Whether these optimizations are already standard practice at major labs remains unclear given the work's anonymous attribution, but the simplicity and elegance of the approach suggest it may have been independently discovered multiple times.



