NVIDIA Megatron Core Achieves Breakthrough Scaling for Mixture-of-Experts Model Training
Key Takeaways
- ▸Megatron Core achieves up to 1,233 TFLOPS/GPU on NVIDIA GB300 for large-scale MoE models, demonstrating significant performance gains for sparse model training
- ▸The framework introduces integrated optimizations across memory (recomputation, offloading), communication (optimized dispatchers, overlapping), and computation (Grouped GEMM, CUDA Graphs) to handle MoE-specific systems challenges
- ▸Support for low-precision training (FP8, NVFP4) and efficient long-context capabilities make the solution viable for training trillion-parameter models on clusters scaling to thousands of GPUs
Summary
NVIDIA has published a comprehensive technical report on scaling Mixture-of-Experts (MoE) models using Megatron Core, addressing fundamental systems challenges in training sparse large language models. The research demonstrates integrated optimizations across memory, communication, and computation layers, enabling efficient training of models ranging from billions to trillions of parameters on GPU clusters with thousands of processors.
The framework achieves exceptional performance metrics on NVIDIA's latest GB300 and GB200 GPUs, reaching 1,233/1,048 TFLOPS/GPU for DeepSeek-V3-685B and 974/919 TFLOPS/GPU for Qwen3-235B. Key innovations include Parallel Folding for flexible multi-dimensional parallelism, support for low-precision training formats (FP8 and NVFP4), and efficient long-context training capabilities.
Megatron Core's production-ready open-source implementation has already been adopted across academia and industry for training major MoE models. The report provides practical guidance on optimizing the coupled constraints between memory, communication, and computation—a critical challenge unique to sparse expert systems where sparsity allows parameter growth to vastly outpace per-token computation.
- Parallel Folding enables flexible multi-dimensional parallelism, addressing the unique coupling constraints between parameters, computation, memory, and communication in sparse expert models
Editorial Opinion
This research represents a critical advancement in scaling sparse neural networks, addressing the fundamental systems challenges that have limited MoE adoption despite their computational efficiency advantages. By co-optimizing across the entire hardware and software stack, NVIDIA's work provides a template for efficiently training the next generation of trillion-parameter models. The open-source nature of Megatron Core and its proven adoption across industry and academia validate its importance as infrastructure for the AI community.


