BotBeat
...
← Back

> ▌

MetaMeta
RESEARCHMeta2026-05-19

The Hidden Costs of Scale: Why Advanced LLM Training Remains Precarious

Key Takeaways

  • ▸Expert routing in mixture-of-experts architectures can break causality—a training-deployment mismatch with profound consequences for model performance
  • ▸Numerical precision bugs and token dropping have affected industry-leading models including Llama 4, Gemini 2 Pro, and GPT-4, with FP16 rounding errors causing 10x accuracy loss
  • ▸Bias compounds during training while variance averages out, making systematic errors far more damaging than random noise
Source:
Hacker Newshttps://www.dwarkesh.com/p/notes-on-pretraining-parallelisms↗

Summary

Dwarkesh Patel has published an in-depth technical analysis of why pretraining runs for large language models fail, drawing on conversations with AI researchers and examining real-world issues encountered during training of major models. The article identifies two primary culprits: breaking causality and introducing bias into training pipelines. Expert routing mechanisms, commonly used in mixture-of-experts models, can create causality violations when token allocation depends on future token preferences—a problem Patel suggests may explain why Llama 4 was underwhelming.

Patel details how token dropping and numerical precision bugs have plagued recent major training efforts across the industry. A particularly striking example involves OpenAI's GPT-4 training, which faced critical slowdowns due to an FP16 floating-point bug in collective operations like all-reduce. The logarithmic density of FP16 representation can cause catastrophic rounding errors when summing many small gradients into large accumulators—a bug that took significant effort to identify and fix.

The article explores whether these training failures represent a finite set of solvable problems or an endless parade of new issues emerging at each scale level. Patel's sources suggest the latter: as models grow larger, researchers continue encountering novel bespoke failures that require custom solutions. The implications are significant for AI development timelines, suggesting that scaling will remain a precarious, failure-prone process rather than a smooth path.

  • Training challenges appear to be a never-ending frontier: each new scale threshold reveals novel failure modes rather than a fixed set of solvable problems
  • The precarious nature of pretraining suggests AI labs face persistent engineering challenges that resist commoditization

Editorial Opinion

This technical deep-dive highlights an uncomfortable truth: scaling large language models remains a fragile operation despite years of industry progress. The causality-breaking issues in expert routing and the catastrophic precision bugs that plagued major training runs suggest that scaling intelligence isn't simply an engineering problem to be solved once and replicated—it's an ongoing frontier of novel failure modes. The suggestion that new bespoke issues emerge at each scale level is bearish on the assumption that AI labs will achieve smooth, predictable scaling curves in the near future.

Large Language Models (LLMs)Machine LearningDeep LearningMLOps & InfrastructureScience & Research

More from Meta

MetaMeta
FUNDING & BUSINESS

Meta Begins Laying Off Thousands of Employees as It Transforms Around AI

2026-05-20
MetaMeta
UPDATE

Meta Introduces MLX Delegate for GPU-Accelerated PyTorch Inference on Apple Silicon

2026-05-20
MetaMeta
RESEARCH

Meta's Llama-Based MOSAIC Framework Achieves 71% Success Rate in AI-Assisted Chemical Synthesis

2026-05-19

Comments

Suggested

Google / AlphabetGoogle / Alphabet
PRODUCT LAUNCH

Google DeepMind Launches Gemini 3.5 Flash: New Lightweight AI Model

2026-05-20
Executive Office of the President of the United States (Policy/Regulation)Executive Office of the President of the United States (Policy/Regulation)
RESEARCH

SID Achieves Search Breakthrough with SID-1, Outperforming GPT-5 at 1k+ QPS Using Reinforcement Learning

2026-05-20
Helmholtz MunichHelmholtz Munich
RESEARCH

MouseMapper: AI Foundation Model Maps Systemic Damage from Obesity at Whole-Body Scale

2026-05-20
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us