BotBeat
...
← Back

> ▌

Alibaba (Cloud)Alibaba (Cloud)
RESEARCHAlibaba (Cloud)2026-04-30

Research Reveals High-Entropy Tokens Are Key to Efficient Reasoning in Alibaba's Qwen Models

Key Takeaways

  • ▸High-entropy 'forking tokens' determine reasoning directions; only this minority of tokens needs optimization for effective RL training
  • ▸Focusing RL updates on 20% of high-entropy tokens outperforms full-gradient training on Qwen3-32B (+11.04 AIME '25) and larger models, showing a strong scaling trend
  • ▸Lowest-entropy tokens are actively harmful when trained exclusively, proving token selection is critical—not all parameters contribute equally to reasoning
Source:
Hacker Newshttps://arxiv.org/abs/2506.01939↗

Summary

A new research paper examining reinforcement learning for large language model reasoning has uncovered a surprising efficiency principle: focusing training updates on just 20% of tokens—specifically high-entropy "forking tokens"—can match or exceed performance of training on all tokens. The study, which tested this approach on Alibaba's Qwen3 models, found that these minority tokens act as critical decision points steering models toward diverse reasoning pathways. On Qwen3-32B, the approach achieved +11.04 points on the AIME '25 benchmark compared to full-gradient training, defying conventional wisdom about training efficiency.

The research adopts a novel perspective on how reinforcement learning with verifiable rewards (RLVR) works by analyzing token entropy patterns in chain-of-thought reasoning. Crucially, the authors found that training exclusively on the 80% of lowest-entropy tokens actually degrades performance, indicating that token selection is fundamental to RL effectiveness. This insight scales across model sizes, with larger Qwen3 models showing stronger improvements when high-entropy tokens are prioritized over smaller models.

The findings have broad implications for LLM training efficiency and reasoning improvement. By identifying which tokens matter most for reasoning performance, researchers can potentially reduce computational costs while improving outcomes—a significant consideration as models scale to larger sizes.

  • Token entropy patterns provide a new framework for understanding why RL improves LLM reasoning and how to optimize future training approaches

Editorial Opinion

This research cuts through the black box of reinforcement learning for reasoning by identifying which tokens actually matter for performance improvement. The finding that 80% of tokens can be safely ignored while training efficiency increases is both counterintuitive and practically valuable. If these insights generalize beyond Qwen, they could significantly reduce the computational burden of reasoning-focused LLM training while improving results—a meaningful step toward more efficient AI systems.

Large Language Models (LLMs)Reinforcement LearningMachine LearningDeep Learning

More from Alibaba (Cloud)

Alibaba (Cloud)Alibaba (Cloud)
RESEARCH

Alibaba Qwen3-Coder Achieves 89% Solve Rate with Debugger Integration, 59% Fewer Turns Required

2026-04-28
Alibaba (Cloud)Alibaba (Cloud)
INDUSTRY REPORT

Open Source AI Dominance: Chinese Models Lead as U.S. Seeks Policy Response

2026-04-28
Alibaba (Cloud)Alibaba (Cloud)
OPEN SOURCE

Civic-SLM: Open-Source AI Model Tailored for U.S. Local Government Documents

2026-04-25

Comments

Suggested

Simon Willison (LLM Open Source Project)Simon Willison (LLM Open Source Project)
UPDATE

LLM 0.32a0 Refactors Core Architecture for Multimodal AI Support

2026-04-30
MicrosoftMicrosoft
UPDATE

VS Code 1.118 Expands AI Agent Capabilities with Copilot and Claude Integration

2026-04-30
OpenAIOpenAI
UPDATE

OpenAI Releases Prompt Guidance for GPT-5.5, Recommends Simpler Outcome-Focused Approaches

2026-04-30
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us