Doubleword Achieves 15% Expert Load Reduction Through Request Reordering in MoE Inference
Key Takeaways
- ▸Request reordering based on embedding similarity reduces MoE expert loads by 15% compared to random batching
- ▸A trained embedding model predicts expert activation patterns more effectively than generic embeddings or category-based clustering
- ▸The optimization achieves 5.4% wall-clock time savings for Qwen3.5-35B with zero changes to model weights or kernels
Summary
Doubleword has developed an inference optimization technique for Mixture-of-Experts (MoE) language models that reduces expert memory loads by 15% through intelligent request reordering. The technique clusters similar prompts together using embedding similarity to overlap required expert weights across requests, reducing the total number of unique experts that need to be loaded from high bandwidth memory—a critical bottleneck in MoE inference.
The approach uses a fine-tuned embedding model (trained variant of BAAI/bge-small-en-v1.5) to predict which experts a sequence will likely activate, grouping prompts with similar expert activation patterns into the same batch. Tested on Qwen3.5-35B-A3B with 40 MoE layers and 1,000 prompts, the technique achieved 15.6% reduction in expert loads compared to random ordering, capturing 73.6% of the theoretical maximum performance. The optimization translates to approximately 5.4% wall-clock time savings, with potential gains reaching 17% on larger datasets.
Critically, this inference optimization requires no changes to model architecture or compute kernels, making it a practical drop-in improvement for existing MoE platforms. Doubleword's batch inference service is particularly well-positioned to leverage this technique, integrating the request reordering step seamlessly into its inference pipeline before batching.
- Performance scales with dataset size—load reductions increase from 12.3% to 17.1% as clustering examples grow from 1,000 to 5,000
- MoE inference is fundamentally memory-bandwidth constrained; this technique directly addresses the expert loading bottleneck
Editorial Opinion
This is a pragmatic optimization that underscores the growing importance of inference engineering in the LLM era. While 15% expert load reduction translating to 5.4% wall-clock improvement may seem modest, such infrastructure-level optimizations compound significantly at scale—particularly for batch inference services processing high-volume prompt streams. The elegance lies in its applicability: requiring no model changes makes this technique immediately deployable across the MoE ecosystem.


