Baidu Open-Sources LoongForge, High-Performance Training Framework with Up to 5× Speedup
Key Takeaways
- ▸Up to 5× training speedup over mainstream open-source baselines, with production deployments achieving 30–50% improvements
- ▸Unified framework supporting the complete training pipeline (pre-training, continued pre-training, SFT) for LLMs, VLMs, VLAs, and diffusion models
- ▸Native dual-hardware support for NVIDIA GPUs and Kunlun XPUs with advanced parallelism strategies and load balancing
Summary
Baidu Baige has released LoongForge as open-source, a modular and scalable training framework for large language models (LLMs), vision-language models (VLMs), vision-language-action models (VLAs), and diffusion models. Built upon Megatron-LM with systemic enhancements, LoongForge delivers up to 5× training speedup over mainstream open-source baselines and natively supports both NVIDIA GPUs and Baidu's proprietary Kunlun XPUs.
The framework introduces several advanced optimization techniques including adaptive FP8 training for mixed-precision efficiency, decoupled encoder-decoder training to eliminate pipeline bottlenecks, MoE-native optimizations for large sparse models, and flexible checkpointing with seamless Megatron-HuggingFace format conversion. LoongForge's heterogeneous parallelism design allows independent tensor/data parallelism and recomputation strategies per model component, enabling optimal throughput and memory efficiency for complex multimodal architectures.
LoongForge builds on years of production refinement as Baidu's internal AIAK-Training-LLM stack, which has powered enterprise customers in education, computer vision, and embodied AI with typical 30–50% speedups and production deployments scaling to 5,000+ Kunlun XPUs. The v0.1.0 open-source release already supports recent high-impact model releases, including LLaVA-OneVision-2.0 and expanded VLA support for GR00T N1.6 with 60%+ training speedups.
- Already integrated into production models and publicly available on GitHub with comprehensive documentation and tutorials
Editorial Opinion
LoongForge's open-source release democratizes access to production-grade training infrastructure that Baidu has refined through thousands of Kunlun XPU deployments. The framework's heterogeneous parallelism design and native support for both NVIDIA and custom XPU hardware make it a significant contribution to open-source training ecosystems, particularly valuable for teams scaling multimodal models. With proven impact on state-of-the-art releases and demonstrated efficiency gains over Megatron-LM, LoongForge could become essential infrastructure for next-generation model development.



