SonicMoE: New Hardware-Efficient Framework Enables Fine-Grained Mixture-of-Experts Models on NVIDIA GPUs

Key Takeaways

▸SonicMoE achieves constant per-layer activation memory footprint regardless of expert granularity, eliminating a critical bottleneck in fine-grained MoE training without requiring extra FLOP-intensive GEMM recomputation
▸The framework delivers 1.87-4.04x speedup over existing MoE kernels and now supports both NVIDIA Hopper and Blackwell architectures through a unified software abstraction
▸The solution enables training of increasingly fine-grained and sparse MoE models that align with emerging scaling laws showing better quality-per-FLOP with smaller, more numerous experts

Source:

Hacker Newshttps://dao-lab.ai/blog/2026/sonicmoe-blackwell/↗

Summary

Researchers have unveiled SonicMoE, a hardware-efficient and software-extensible blueprint for training fine-grained Mixture-of-Experts (MoE) models that addresses critical scaling challenges in modern language models. The framework introduces an IO-aware algorithm that keeps activation memory independent of expert granularity—a key bottleneck that has plagued existing MoE training kernels like ScatterMoE and MoMoE. SonicMoE achieves 1.87-4.04x relative speedup compared to existing solutions while now running at peak throughput on NVIDIA's latest Blackwell GPUs (B200/B300) in addition to Hopper (H100) support.

The breakthrough is timely, as frontier open-source models increasingly adopt more fine-grained and sparser MoE architectures to improve quality per FLOP. Recent models like DeepSeek V3.2, Kimi K2.5, and Qwen3-Next-80B-A3B-Instruct demonstrate this trend, pushing both granularity and sparsity to new extremes. SonicMoE solves two fundamental hardware constraints: activation memory that scales with expert granularity, and IO costs that grow as experts become smaller. The solution leverages a unified software abstraction on QuACK that enables straightforward porting across GPU architectures, and exploits Blackwell hardware features to hide IO costs behind computation.

SonicMoE's IO-aware algorithm and architecture-specific optimizations demonstrate how software innovation can unlock the potential of fine-grained expert models on contemporary hardware

Editorial Opinion

SonicMoE represents an important step forward in making frontier MoE architectures practically trainable at scale. As the field has discovered that finer granularity and higher sparsity improve model quality per FLOP, hardware efficiency has become the limiting factor—not algorithmic innovation. By decoupling activation memory from expert granularity and delivering meaningful speedups, this work removes a genuine obstacle to training the next generation of large-scale sparse models. The fact that it's already optimized for Blackwell shows commendable hardware-software co-design thinking.

SonicMoE: New Hardware-Efficient Framework Enables Fine-Grained Mixture-of-Experts Models on NVIDIA GPUs

Key Takeaways

▸SonicMoE achieves constant per-layer activation memory footprint regardless of expert granularity, eliminating a critical bottleneck in fine-grained MoE training without requiring extra FLOP-intensive GEMM recomputation
▸The framework delivers 1.87-4.04x speedup over existing MoE kernels and now supports both NVIDIA Hopper and Blackwell architectures through a unified software abstraction
▸The solution enables training of increasingly fine-grained and sparse MoE models that align with emerging scaling laws showing better quality-per-FLOP with smaller, more numerous experts

Summary

SonicMoE's IO-aware algorithm and architecture-specific optimizations demonstrate how software innovation can unlock the potential of fine-grained expert models on contemporary hardware

Editorial Opinion

SonicMoE represents an important step forward in making frontier MoE architectures practically trainable at scale. As the field has discovered that finer granularity and higher sparsity improve model quality per FLOP, hardware efficiency has become the limiting factor—not algorithmic innovation. By decoupling activation memory from expert granularity and delivering meaningful speedups, this work removes a genuine obstacle to training the next generation of large-scale sparse models. The fact that it's already optimized for Blackwell shows commendable hardware-software co-design thinking.

SonicMoE: New Hardware-Efficient Framework Enables Fine-Grained Mixture-of-Experts Models on NVIDIA GPUs

Key Takeaways

Summary

Editorial Opinion

More from NVIDIA

NVIDIA Raises Jetson Prices by Up to 101%, Impacting Developer Accessibility

NVIDIA Vera Rubin GPU Targets AI Inference Efficiency with Performance-per-Watt Focus

NVIDIA Parakeet Wins Speech Recognition Benchmark; New Contender MOSS Offers Alternative

Comments

Suggested

Anthropic's Claude Fable Disproves 87-Year-Old Mathematical Conjecture in Historic AI Breakthrough

NVIDIA Raises Jetson Prices by Up to 101%, Impacting Developer Accessibility

New Attack Vector Against RAG Agents Bypasses Traditional Defenses Through Information Manipulation

SonicMoE: New Hardware-Efficient Framework Enables Fine-Grained Mixture-of-Experts Models on NVIDIA GPUs

Key Takeaways

Summary

Editorial Opinion

More from NVIDIA

NVIDIA Raises Jetson Prices by Up to 101%, Impacting Developer Accessibility

NVIDIA Vera Rubin GPU Targets AI Inference Efficiency with Performance-per-Watt Focus

NVIDIA Parakeet Wins Speech Recognition Benchmark; New Contender MOSS Offers Alternative

Comments

Suggested

Anthropic's Claude Fable Disproves 87-Year-Old Mathematical Conjecture in Historic AI Breakthrough

NVIDIA Raises Jetson Prices by Up to 101%, Impacting Developer Accessibility

New Attack Vector Against RAG Agents Bypasses Traditional Defenses Through Information Manipulation