Together AI Named Preferred Cloud Partner for MiniMax M3, Delivers Substantial Inference Optimizations
Key Takeaways
- ▸Together AI named preferred cloud partner for MiniMax M3; will host as developer endpoint following public release
- ▸Together AI's optimization work delivered 81–125% throughput improvements via specialized sparse attention kernels and multimodal preprocessing
- ▸MiniMax M3 features a 1M-token context window, native multimodal support, and state-of-the-art coding and agentic performance
Summary
Together AI announced it has become the preferred cloud infrastructure partner for MiniMax's newly launched M3 model, a state-of-the-art large language model featuring a 1-million-token context window, native multimodal capabilities, and strong performance on coding and agentic workflows. Together AI will host M3 as a developer endpoint upon its public release. The partnership marks a significant validation of Together AI's capabilities in serving frontier AI models at scale.
To support efficient production deployment of MiniMax M3, Together AI's Inference and Kernel teams developed substantial technical optimizations specifically tailored to the model's unique architecture. These optimizations include a KV-Block-Major sparse attention kernel, a novel paged attention integration for MiniMax's Sparse Attention (MSA) mechanism, a highly optimized index scoring kernel, and a Rust-based multimodal preprocessing gateway. The optimizations collectively deliver 81–125% throughput improvements across different concurrency levels.
MiniMax M3's core innovation is its Sparse Attention architecture (MSA), which addresses long-context processing challenges by limiting the number of tokens each query attends to, reducing computational complexity from O(N²) to a more manageable order. This architectural breakthrough enables the model to support a 1-million-token context window while remaining economical to serve. Together AI's optimization work ensures that M3 can be deployed efficiently in production despite these advanced capabilities, achieving speedups of over 9x in the prefilling stage and 15x in the decoding stage.
The partnership demonstrates the growing importance of specialized inference infrastructure for deploying frontier AI models. Together AI's ability to optimize and serve complex models like M3 at production scale validates their position as a critical infrastructure partner for AI companies pushing the boundaries of long-context reasoning, multimodal understanding, and agentic workflows.
- Partnership highlights the critical infrastructure engineering required to deploy advanced LLMs at production scale



