BotBeat
...
← Back

> ▌

Alibaba (Cloud)Alibaba (Cloud)
RESEARCHAlibaba (Cloud)2026-04-30

Research Reveals High-Entropy Tokens Are Key to Efficient Reasoning in Alibaba's Qwen Models

Key Takeaways

  • ▸High-entropy 'forking tokens' determine reasoning directions; only this minority of tokens needs optimization for effective RL training
  • ▸Focusing RL updates on 20% of high-entropy tokens outperforms full-gradient training on Qwen3-32B (+11.04 AIME '25) and larger models, showing a strong scaling trend
  • ▸Lowest-entropy tokens are actively harmful when trained exclusively, proving token selection is critical—not all parameters contribute equally to reasoning
Source:
Hacker Newshttps://arxiv.org/abs/2506.01939↗

Summary

A new research paper examining reinforcement learning for large language model reasoning has uncovered a surprising efficiency principle: focusing training updates on just 20% of tokens—specifically high-entropy "forking tokens"—can match or exceed performance of training on all tokens. The study, which tested this approach on Alibaba's Qwen3 models, found that these minority tokens act as critical decision points steering models toward diverse reasoning pathways. On Qwen3-32B, the approach achieved +11.04 points on the AIME '25 benchmark compared to full-gradient training, defying conventional wisdom about training efficiency.

The research adopts a novel perspective on how reinforcement learning with verifiable rewards (RLVR) works by analyzing token entropy patterns in chain-of-thought reasoning. Crucially, the authors found that training exclusively on the 80% of lowest-entropy tokens actually degrades performance, indicating that token selection is fundamental to RL effectiveness. This insight scales across model sizes, with larger Qwen3 models showing stronger improvements when high-entropy tokens are prioritized over smaller models.

The findings have broad implications for LLM training efficiency and reasoning improvement. By identifying which tokens matter most for reasoning performance, researchers can potentially reduce computational costs while improving outcomes—a significant consideration as models scale to larger sizes.

  • Token entropy patterns provide a new framework for understanding why RL improves LLM reasoning and how to optimize future training approaches

Editorial Opinion

This research cuts through the black box of reinforcement learning for reasoning by identifying which tokens actually matter for performance improvement. The finding that 80% of tokens can be safely ignored while training efficiency increases is both counterintuitive and practically valuable. If these insights generalize beyond Qwen, they could significantly reduce the computational burden of reasoning-focused LLM training while improving results—a meaningful step toward more efficient AI systems.

Large Language Models (LLMs)Reinforcement LearningMachine LearningDeep Learning

More from Alibaba (Cloud)

Alibaba (Cloud)Alibaba (Cloud)
RESEARCH

Local AI Handwriting Recognition Finally Becomes Practical with Open-Source Models

2026-06-02
Alibaba (Cloud)Alibaba (Cloud)
RESEARCH

Research Reveals LLMs Absorb False Information Despite Explicit Warnings

2026-05-28
Alibaba (Cloud)Alibaba (Cloud)
RESEARCH

Spreadsheet-RL: Advancing LLM Agents on Realistic Spreadsheet Tasks

2026-05-27

Comments

Suggested

IplanRIOIplanRIO
RESEARCH

Brazilian AI Initiative's 397B Model Exposed as Undisclosed Weight Blend

2026-06-14
AnthropicAnthropic
INDUSTRY REPORT

Cloud-Based LLM Gold Rush Ends as Industry Shifts to On-Device AI

2026-06-14
AnthropicAnthropic
PRODUCT LAUNCH

Anthropic Launches Claude Managed Agents for Production Deployment at Scale

2026-06-14
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us