SonicMoE: New Hardware-Efficient Framework Enables Fine-Grained Mixture-of-Experts Models on NVIDIA GPUs
Key Takeaways
- ▸SonicMoE achieves constant per-layer activation memory footprint regardless of expert granularity, eliminating a critical bottleneck in fine-grained MoE training without requiring extra FLOP-intensive GEMM recomputation
- ▸The framework delivers 1.87-4.04x speedup over existing MoE kernels and now supports both NVIDIA Hopper and Blackwell architectures through a unified software abstraction
- ▸The solution enables training of increasingly fine-grained and sparse MoE models that align with emerging scaling laws showing better quality-per-FLOP with smaller, more numerous experts
Summary
Researchers have unveiled SonicMoE, a hardware-efficient and software-extensible blueprint for training fine-grained Mixture-of-Experts (MoE) models that addresses critical scaling challenges in modern language models. The framework introduces an IO-aware algorithm that keeps activation memory independent of expert granularity—a key bottleneck that has plagued existing MoE training kernels like ScatterMoE and MoMoE. SonicMoE achieves 1.87-4.04x relative speedup compared to existing solutions while now running at peak throughput on NVIDIA's latest Blackwell GPUs (B200/B300) in addition to Hopper (H100) support.
The breakthrough is timely, as frontier open-source models increasingly adopt more fine-grained and sparser MoE architectures to improve quality per FLOP. Recent models like DeepSeek V3.2, Kimi K2.5, and Qwen3-Next-80B-A3B-Instruct demonstrate this trend, pushing both granularity and sparsity to new extremes. SonicMoE solves two fundamental hardware constraints: activation memory that scales with expert granularity, and IO costs that grow as experts become smaller. The solution leverages a unified software abstraction on QuACK that enables straightforward porting across GPU architectures, and exploits Blackwell hardware features to hide IO costs behind computation.
- SonicMoE's IO-aware algorithm and architecture-specific optimizations demonstrate how software innovation can unlock the potential of fine-grained expert models on contemporary hardware
Editorial Opinion
SonicMoE represents an important step forward in making frontier MoE architectures practically trainable at scale. As the field has discovered that finer granularity and higher sparsity improve model quality per FLOP, hardware efficiency has become the limiting factor—not algorithmic innovation. By decoupling activation memory from expert granularity and delivering meaningful speedups, this work removes a genuine obstacle to training the next generation of large-scale sparse models. The fact that it's already optimized for Blackwell shows commendable hardware-software co-design thinking.



