BotBeat
...
← Back

> ▌

DoublewordDoubleword
RESEARCHDoubleword2026-06-08

Doubleword Achieves 15% Expert Load Reduction in MoE Inference Through Input Reordering

Key Takeaways

  • ▸Input reordering based on embedding similarity reduces MoE expert loads by ~15% with zero model or kernel modifications
  • ▸A custom embedding model trained on expert activation patterns captures 73.6% of theoretical maximum savings (vs. 58.3% for standard embeddings)
  • ▸Translates to 5.4% wall-clock throughput improvement on Qwen3.5-35B-A3B despite MoE operations being only 43% of total compute
Source:
Hacker Newshttps://blog.doubleword.ai/moe-expert-coactivations↗

Summary

Doubleword has demonstrated that reordering prompts before batch inference can significantly reduce memory bandwidth requirements in Mixture-of-Experts (MoE) models. Since MoE architectures load different expert weights for different inputs, clustering similar prompts together in batches reduces the total number of unique experts that must be loaded from memory. Using embedding-based similarity clustering, researchers achieved approximately 15% reduction in expert weight loads with no changes to model weights or inference kernels.

The technique employs both pre-trained embeddings (BAAI/bge-small-en-v1.5) and a custom embedding model trained specifically on expert activation patterns. The custom model achieved 15.6% reduction in expert loads, capturing 73.6% of the theoretical maximum determined through oracle clustering analysis. Testing on Qwen3.5-35B-A3B demonstrated these reductions translate to 5.4% wall-clock time improvements, a meaningful gain considering MoE operations comprise only 43% of total forward pass operations.

The approach generalizes well across different prompt types and datasets. Performance improves further with larger datasets—expert load reductions increased from 12.3% to 17.1% when scaling from 1,000 to 5,000 examples per batch. Even on out-of-domain datasets like Wildchat, the custom embedding model maintained 12.3% load reduction, demonstrating robust generalization.

  • Performance improves with larger datasets and generalizes across diverse prompt types and out-of-domain data

Editorial Opinion

This work addresses a fundamental inference bottleneck with an elegant, immediately deployable solution. The ability to achieve meaningful throughput gains through intelligent batch reordering and standard embedding techniques—without touching model weights or custom kernels—makes this particularly valuable for production MoE serving systems. While 5.4% wall-clock improvement may seem modest, it represents a genuine win in a critical inference constraint and will likely become standard practice for batch MoE inference.

Large Language Models (LLMs)Machine LearningMLOps & Infrastructure

More from Doubleword

DoublewordDoubleword
RESEARCH

Doubleword Achieves 15% Expert Load Reduction Through Request Reordering in MoE Inference

2026-05-20

Comments

Suggested

NVIDIANVIDIA
RESEARCH

Researchers Challenge HPC Dogma: FP8 With Ozaki Scheme II Can Match FP64 Accuracy on NVIDIA's Blackwell GPUs

2026-06-08
AnthropicAnthropic
UPDATE

Anthropic Brings Claude to Apple's Foundation Models Framework

2026-06-08
AppleApple
PARTNERSHIP

Apple Expands Private Cloud Compute to Google Cloud with NVIDIA Partnership

2026-06-08
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us