PRISM: Mid-Training Emerges as Primary Driver of 3-4x Improvement in LLM Reasoning Benchmarks
Key Takeaways
- ▸Mid-training with ~27B high-quality tokens yields consistent gains (+15-40 math, +5-12 code, +6-13 science) and enables PRISM + RL to achieve 3-4x improvements in reasoning tasks
- ▸Data composition during mid-training is critical: science data unlocks +17-28 point GPQA-Diamond gains in subsequent RL, while RL data mix changes produce minimal differences (<2 points)
- ▸Mid-training restructures 90%+ of model weights while RL applies surgical changes to ~5% of parameters, yet RL only succeeds on models pre-positioned by effective mid-training
Summary
A comprehensive empirical study introduces PRISM, a framework for understanding mid-training design choices in large language models. Researchers conducted controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H) with scales from 3B to 24B parameters. The study found that mid-training on approximately 27B high-quality tokens yields consistent improvements: +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance.
When combined with reinforcement learning, the PRISM framework achieved remarkable results: a 3-4x macro-average improvement across six reasoning benchmarks (improving from under 12 to 29-42 points). Critically, this RL pipeline only succeeds on mid-trained models; applying RL directly to most base models yields near-zero AIME scores. The research reveals that data composition matters significantly at the mid-training stage: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains, while varying the RL data mix produces less than 2 point differences.
Mechanistic analysis provides deeper insights into why mid-training is so effective. Mid-training densely restructures over 90% of model weights through comprehensive internal reorganization, while RL makes sparse, front-loaded refinements affecting only approximately 5% of parameters. Representation analysis using CKA (Centered Kernel Alignment) confirms that RL consistently preserves the representational geometry established during mid-training with scores above 0.998 across different architectures.
- The framework provides practical guidance for designing robust mid-training pipelines that create configurations enabling reliable reasoning enhancement
Editorial Opinion
This research makes a significant methodological contribution by systematically demystifying the interplay between mid-training and reinforcement learning in LLM development. The finding that data composition and weight restructuring during mid-training far outweigh RL tuning in importance challenges conventional wisdom and offers concrete guidance for practitioners. The 3-4x reasoning improvement demonstrates the substantial potential of properly sequenced training pipelines, making this work invaluable for understanding how to reliably enhance reasoning capabilities in future large language models.



