Research Reveals High-Entropy Tokens Are Key to Efficient Reasoning in Alibaba's Qwen Models
Key Takeaways
- ▸High-entropy 'forking tokens' determine reasoning directions; only this minority of tokens needs optimization for effective RL training
- ▸Focusing RL updates on 20% of high-entropy tokens outperforms full-gradient training on Qwen3-32B (+11.04 AIME '25) and larger models, showing a strong scaling trend
- ▸Lowest-entropy tokens are actively harmful when trained exclusively, proving token selection is critical—not all parameters contribute equally to reasoning
Summary
A new research paper examining reinforcement learning for large language model reasoning has uncovered a surprising efficiency principle: focusing training updates on just 20% of tokens—specifically high-entropy "forking tokens"—can match or exceed performance of training on all tokens. The study, which tested this approach on Alibaba's Qwen3 models, found that these minority tokens act as critical decision points steering models toward diverse reasoning pathways. On Qwen3-32B, the approach achieved +11.04 points on the AIME '25 benchmark compared to full-gradient training, defying conventional wisdom about training efficiency.
The research adopts a novel perspective on how reinforcement learning with verifiable rewards (RLVR) works by analyzing token entropy patterns in chain-of-thought reasoning. Crucially, the authors found that training exclusively on the 80% of lowest-entropy tokens actually degrades performance, indicating that token selection is fundamental to RL effectiveness. This insight scales across model sizes, with larger Qwen3 models showing stronger improvements when high-entropy tokens are prioritized over smaller models.
The findings have broad implications for LLM training efficiency and reasoning improvement. By identifying which tokens matter most for reasoning performance, researchers can potentially reduce computational costs while improving outcomes—a significant consideration as models scale to larger sizes.
- Token entropy patterns provide a new framework for understanding why RL improves LLM reasoning and how to optimize future training approaches
Editorial Opinion
This research cuts through the black box of reinforcement learning for reasoning by identifying which tokens actually matter for performance improvement. The finding that 80% of tokens can be safely ignored while training efficiency increases is both counterintuitive and practically valuable. If these insights generalize beyond Qwen, they could significantly reduce the computational burden of reasoning-focused LLM training while improving results—a meaningful step toward more efficient AI systems.



