PRISM Study Reveals Mid-Training Strategy Unlocks 3-4x Reasoning Improvements in Large Language Models
Key Takeaways
- ▸Mid-training on ~27B high-quality tokens provides consistent reasoning improvements (+15-40 math, +5-12 code, +6-13 science) across diverse model architectures and scales
- ▸Mid-training + RL pipeline achieves 3-4x reasoning improvement versus RL alone, with AIME-comparable performance rising from near-zero to competitive levels
- ▸Data composition during mid-training is the critical factor for downstream RL success—science data inclusion drives +17-28 GPQA-Diamond gains—while RL mix adjustments yield marginal gains
Summary
Researchers have published PRISM, a comprehensive empirical study demonstrating that mid-training—the practice of continued pre-training on high-quality tokens between initial training and reinforcement learning—significantly enhances reasoning capabilities in large language models. The study, conducted across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H) at scales from 3B to 24B parameters, shows consistent improvements of +15 to +40 points on math benchmarks, +5 to +12 points on coding tasks, and +6 to +13 points on science tasks while maintaining general performance.
Crucially, the full PRISM pipeline combining mid-training with reinforcement learning achieves a 3-4x improvement on reasoning benchmarks, raising macro-average scores from under 12 to 29-42, whereas applying RL directly to base models yields near-zero AIME scores. The research reveals that data composition during mid-training is the primary driver of performance gains—including science data unlocks +17 to +28 point GPQA-Diamond improvements—while RL configuration changes produce less than 2 point differences. Mechanistically, mid-training restructures over 90% of model weights through dense changes, while RL makes sparse, targeted refinements to only ~5% of parameters, with representation analysis showing RL preserves mid-training's representational geometry across architectures.
- Mid-training densely restructures 90%+ of model weights versus RL's sparse 5% refinements, placing models in optimal configurations for RL effectiveness
Editorial Opinion
PRISM provides valuable empirical validation for a training paradigm that challenges the efficiency assumptions of direct instruction-tuning approaches. The finding that mid-training's dense weight restructuring creates a prerequisite foundation for RL success suggests that training pipelines have been underutilizing this intermediate phase, and organizations may achieve significantly better reasoning performance by adopting this three-stage approach. However, the computational cost-benefit analysis of extending training pipelines warrants careful consideration before widespread industry adoption.



