PRISM: Mid-Training Emerges as Primary Driver of 3-4x Improvement in LLM Reasoning Benchmarks

Key Takeaways

▸Mid-training with ~27B high-quality tokens yields consistent gains (+15-40 math, +5-12 code, +6-13 science) and enables PRISM + RL to achieve 3-4x improvements in reasoning tasks
▸Data composition during mid-training is critical: science data unlocks +17-28 point GPQA-Diamond gains in subsequent RL, while RL data mix changes produce minimal differences (<2 points)
▸Mid-training restructures 90%+ of model weights while RL applies surgical changes to ~5% of parameters, yet RL only succeeds on models pre-positioned by effective mid-training

Source:

Hacker Newshttps://arxiv.org/abs/2603.17074↗

Summary

A comprehensive empirical study introduces PRISM, a framework for understanding mid-training design choices in large language models. Researchers conducted controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H) with scales from 3B to 24B parameters. The study found that mid-training on approximately 27B high-quality tokens yields consistent improvements: +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance.

When combined with reinforcement learning, the PRISM framework achieved remarkable results: a 3-4x macro-average improvement across six reasoning benchmarks (improving from under 12 to 29-42 points). Critically, this RL pipeline only succeeds on mid-trained models; applying RL directly to most base models yields near-zero AIME scores. The research reveals that data composition matters significantly at the mid-training stage: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains, while varying the RL data mix produces less than 2 point differences.

Mechanistic analysis provides deeper insights into why mid-training is so effective. Mid-training densely restructures over 90% of model weights through comprehensive internal reorganization, while RL makes sparse, front-loaded refinements affecting only approximately 5% of parameters. Representation analysis using CKA (Centered Kernel Alignment) confirms that RL consistently preserves the representational geometry established during mid-training with scores above 0.998 across different architectures.

The framework provides practical guidance for designing robust mid-training pipelines that create configurations enabling reliable reasoning enhancement

Editorial Opinion

This research makes a significant methodological contribution by systematically demystifying the interplay between mid-training and reinforcement learning in LLM development. The finding that data composition and weight restructuring during mid-training far outweigh RL tuning in importance challenges conventional wisdom and offers concrete guidance for practitioners. The 3-4x reasoning improvement demonstrates the substantial potential of properly sequenced training pipelines, making this work invaluable for understanding how to reliably enhance reasoning capabilities in future large language models.

PRISM: Mid-Training Emerges as Primary Driver of 3-4x Improvement in LLM Reasoning Benchmarks

Key Takeaways

▸Mid-training with ~27B high-quality tokens yields consistent gains (+15-40 math, +5-12 code, +6-13 science) and enables PRISM + RL to achieve 3-4x improvements in reasoning tasks
▸Data composition during mid-training is critical: science data unlocks +17-28 point GPQA-Diamond gains in subsequent RL, while RL data mix changes produce minimal differences (<2 points)
▸Mid-training restructures 90%+ of model weights while RL applies surgical changes to ~5% of parameters, yet RL only succeeds on models pre-positioned by effective mid-training

Summary

The framework provides practical guidance for designing robust mid-training pipelines that create configurations enabling reliable reasoning enhancement

Editorial Opinion

This research makes a significant methodological contribution by systematically demystifying the interplay between mid-training and reinforcement learning in LLM development. The finding that data composition and weight restructuring during mid-training far outweigh RL tuning in importance challenges conventional wisdom and offers concrete guidance for practitioners. The 3-4x reasoning improvement demonstrates the substantial potential of properly sequenced training pipelines, making this work invaluable for understanding how to reliably enhance reasoning capabilities in future large language models.

PRISM: Mid-Training Emerges as Primary Driver of 3-4x Improvement in LLM Reasoning Benchmarks

Key Takeaways

Summary

Editorial Opinion

More from NVIDIA

Researchers Reverse-Engineer NVIDIA's Closed-Source GPU Driver to Reveal Hardware Command Streams

NVIDIA Launches Nemotron 3 Nano Omni: Efficient Open-Weight Multimodal AI Model for Enterprise Documents and Video

NVIDIA Executive Reveals AI Compute Costs Dwarf Human Labor Expenses

Comments

Suggested

Theori's AI Platform Discovers Nine-Year-Old Critical Linux Vulnerability in One Hour

Google's TurboQuant: Cutting AI Memory Usage by 6x with Real-Time KV Cache Compression

Claude Security Now Available in Public Beta for Claude Enterprise Customers

PRISM: Mid-Training Emerges as Primary Driver of 3-4x Improvement in LLM Reasoning Benchmarks

Key Takeaways

Summary

Editorial Opinion

More from NVIDIA

Researchers Reverse-Engineer NVIDIA's Closed-Source GPU Driver to Reveal Hardware Command Streams

NVIDIA Launches Nemotron 3 Nano Omni: Efficient Open-Weight Multimodal AI Model for Enterprise Documents and Video

NVIDIA Executive Reveals AI Compute Costs Dwarf Human Labor Expenses

Comments

Suggested

Theori's AI Platform Discovers Nine-Year-Old Critical Linux Vulnerability in One Hour

Google's TurboQuant: Cutting AI Memory Usage by 6x with Real-Time KV Cache Compression

Claude Security Now Available in Public Beta for Claude Enterprise Customers