BotBeat
...
← Back

> ▌

NVIDIANVIDIA
RESEARCHNVIDIA2026-04-22

SonicMoE: New Hardware-Efficient Framework Enables Fine-Grained Mixture-of-Experts Models on NVIDIA GPUs

Key Takeaways

  • ▸SonicMoE achieves constant per-layer activation memory footprint regardless of expert granularity, eliminating a critical bottleneck in fine-grained MoE training without requiring extra FLOP-intensive GEMM recomputation
  • ▸The framework delivers 1.87-4.04x speedup over existing MoE kernels and now supports both NVIDIA Hopper and Blackwell architectures through a unified software abstraction
  • ▸The solution enables training of increasingly fine-grained and sparse MoE models that align with emerging scaling laws showing better quality-per-FLOP with smaller, more numerous experts
Source:
Hacker Newshttps://dao-lab.ai/blog/2026/sonicmoe-blackwell/↗

Summary

Researchers have unveiled SonicMoE, a hardware-efficient and software-extensible blueprint for training fine-grained Mixture-of-Experts (MoE) models that addresses critical scaling challenges in modern language models. The framework introduces an IO-aware algorithm that keeps activation memory independent of expert granularity—a key bottleneck that has plagued existing MoE training kernels like ScatterMoE and MoMoE. SonicMoE achieves 1.87-4.04x relative speedup compared to existing solutions while now running at peak throughput on NVIDIA's latest Blackwell GPUs (B200/B300) in addition to Hopper (H100) support.

The breakthrough is timely, as frontier open-source models increasingly adopt more fine-grained and sparser MoE architectures to improve quality per FLOP. Recent models like DeepSeek V3.2, Kimi K2.5, and Qwen3-Next-80B-A3B-Instruct demonstrate this trend, pushing both granularity and sparsity to new extremes. SonicMoE solves two fundamental hardware constraints: activation memory that scales with expert granularity, and IO costs that grow as experts become smaller. The solution leverages a unified software abstraction on QuACK that enables straightforward porting across GPU architectures, and exploits Blackwell hardware features to hide IO costs behind computation.

  • SonicMoE's IO-aware algorithm and architecture-specific optimizations demonstrate how software innovation can unlock the potential of fine-grained expert models on contemporary hardware

Editorial Opinion

SonicMoE represents an important step forward in making frontier MoE architectures practically trainable at scale. As the field has discovered that finer granularity and higher sparsity improve model quality per FLOP, hardware efficiency has become the limiting factor—not algorithmic innovation. By decoupling activation memory from expert granularity and delivering meaningful speedups, this work removes a genuine obstacle to training the next generation of large-scale sparse models. The fact that it's already optimized for Blackwell shows commendable hardware-software co-design thinking.

Large Language Models (LLMs)Machine LearningDeep LearningMLOps & InfrastructureAI Hardware

More from NVIDIA

NVIDIANVIDIA
PARTNERSHIP

NVIDIA and Adobe Partner to Advance Agentic AI for Creative Enterprises

2026-04-22
NVIDIANVIDIA
PARTNERSHIP

NVIDIA and Google Cloud Expand Partnership on Agentic and Physical AI, Announce New GPU Instances and Enterprise Solutions

2026-04-22
NVIDIANVIDIA
OPEN SOURCE

Parrot: New C++ Library Simplifies GPU-Accelerated Array Operations with Fused Operations

2026-04-21

Comments

Suggested

N/AN/A
RESEARCH

Humanoid Robots Complete Beijing Half-Marathon, Demonstrating Rapid Advances in Autonomous Locomotion

2026-04-22
MetaMeta
RESEARCH

Meta Achieves >90% Effective Training Time for Recommendation Workloads Through Infrastructure Optimization

2026-04-22
OpenAIOpenAI
PRODUCT LAUNCH

Developer Rebuilds PostgreSQL in Rust with AI Assistance, Achieves 250K Lines in Two Weeks

2026-04-22
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us