BotBeat
...
← Back

> ▌

MetaMeta
RESEARCHMeta2026-04-22

Meta Achieves >90% Effective Training Time for Recommendation Workloads Through Infrastructure Optimization

Key Takeaways

  • ▸Meta developed Effective Training Time (ETT%) as a metric to quantify training efficiency, achieving >90% for offline workloads by end of 2025
  • ▸Infrastructure overhead reduction focused on three areas: Time to Start, Time to Recover, and failure prevention, with over 40 technologies implemented
  • ▸PyTorch 2 compilation optimizations and TorchRec improvements have been contributed to open source, enabling broader industry adoption of efficiency gains
Source:
Hacker Newshttps://pytorch.org/blog/optimizing-effective-training-time-for-metas-internal-recommendation-ranking-workloads/↗

Summary

Meta has announced a significant efficiency milestone in AI model training, achieving over 90% Effective Training Time (ETT%) for offline training workloads by the end of 2025. The company developed a comprehensive framework to measure and optimize training efficiency, defining ETT% as the percentage of total end-to-end wall time dedicated to productive training while accounting for overheads like initialization, orchestration, checkpointing, failures, and recovery.

The achievement resulted from optimizing across 40+ technologies focused on three core areas: Time to Start (reducing hardware setup and PyTorch 2 compilation delays), Time to Recover (improving job restart efficiency after failures), and Number of Failures (enhancing infrastructure reliability). Meta's approach directly addresses industry-wide challenges of scaling AI workloads under tight compute budgets and aggressive ROI targets.

Meta has shared several improvements publicly through open-source contributions, including TorchRec sharding plan enhancements and PyTorch 2 compilation optimizations that reduce compile time and recompilation overhead. While some optimizations remain Meta-specific (such as checkpointing and model publishing improvements), the methodology and shared tools provide a blueprint for other organizations seeking to improve training infrastructure efficiency.

  • The framework addresses common industry bottlenecks including initialization delays, model checkpointing, and job recovery processes

Editorial Opinion

Meta's achievement of >90% Effective Training Time represents a meaningful milestone in AI infrastructure optimization, though the remaining 10% suggests there are still significant efficiency gains available. By open-sourcing key improvements like PyTorch 2 optimizations, Meta demonstrates practical commitment to raising industry standards beyond its own operations. However, the complexity of their approach—requiring 40+ technologies across multiple optimization domains—highlights how challenging it remains for smaller organizations to match hyperscaler infrastructure efficiency.

Machine LearningDeep LearningMLOps & InfrastructureScience & ResearchOpen Source

More from Meta

MetaMeta
INDUSTRY REPORT

Meta Employees Resist Mandatory AI Training Program

2026-04-22
MetaMeta
POLICY & REGULATION

Meta Employees Express Concerns Over AI Training Using Work Computer Usage Data

2026-04-22
MetaMeta
PRODUCT LAUNCH

Ray-Ban Meta and Oakley Meta AI Glasses Now Available in Singapore

2026-04-22

Comments

Suggested

N/AN/A
RESEARCH

Humanoid Robots Complete Beijing Half-Marathon, Demonstrating Rapid Advances in Autonomous Locomotion

2026-04-22
NVIDIANVIDIA
RESEARCH

SonicMoE: New Hardware-Efficient Framework Enables Fine-Grained Mixture-of-Experts Models on NVIDIA GPUs

2026-04-22
OpenAIOpenAI
PRODUCT LAUNCH

Developer Rebuilds PostgreSQL in Rust with AI Assistance, Achieves 250K Lines in Two Weeks

2026-04-22
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us