Analysis Reveals Reinforcement Learning Scaling Requires 10,000x More Compute Than Inference Scaling for Same Performance Gains
Key Takeaways
- ▸RL training compute scales half as efficiently as inference compute for performance gains—requiring 10,000x more compute for the same improvement that 100x inference scaling achieves
- ▸Most performance gains in OpenAI's o1 model came from enabling longer chain-of-thought reasoning (inference-scaling) rather than from RL training itself
- ▸Deployment costs multiply directly with inference compute requirements (30x longer thinking time = 30x higher costs per query), creating significant economic pressure
Summary
Philosopher and AI researcher Toby Ord has published a detailed analysis examining how reinforcement learning (RL) scales in modern AI systems, with significant implications for the cost and development of reasoning models. His analysis of OpenAI's o1 model charts reveals that RL-scaling—increasing compute during training—has approximately half the slope of inference-scaling on logarithmic axes. This mathematical relationship means that achieving the same performance improvement through RL training requires 100 times more compute than achieving it through longer inference times (chain-of-thought reasoning).
The analysis highlights that in OpenAI's initial o1 release, most performance gains came from unlocking inference-scaling capabilities rather than the RL training itself. While RL training provided a modest boost and enabled the model to use 30x longer chains of thought productively, the extended inference time contributed the larger performance improvement. This finding has major cost implications: if headline performance requires 30x more inference compute, deployment costs multiply by the same factor—expenses that must be paid with every model use and cannot be amortized through volume.
Ord's analysis demonstrates that across multiple benchmarks (AIME, ARC-AGI) and models (OpenAI's o1, Anthropic's Sonnet 3.7), a consistent pattern emerges: 100x inference-scaling typically drives performance from 20% to 80% accuracy. However, achieving the same improvement through RL-scaling would require 10,000x more training compute (100 squared, due to the half-slope relationship). This stark difference suggests fundamental limits to how efficiently RL can improve reasoning capabilities compared to simply allowing models more time to think.
- The pattern holds consistently across multiple models and benchmarks: 100x inference-scaling typically improves performance from 20% to 80% accuracy
- These scaling dynamics suggest inference-time compute may be more cost-effective for capability improvements than additional RL training
Editorial Opinion
This analysis reveals a potentially critical constraint on the RL-scaling paradigm that has dominated recent AI development. If RL training truly requires 100x more compute than inference to achieve equivalent performance gains, the economic calculus of AI development shifts dramatically—favoring architectures optimized for inference efficiency over ever-larger training runs. The finding also raises questions about whether we're approaching fundamental limits in how much reasoning ability can be "baked in" through training versus unlocked at inference time, with profound implications for both AI safety (can we align reasoning that emerges at inference?) and business models (recurring inference costs vs. one-time training investments).


