BotBeat
...
← Back

> ▌

Together AITogether AI
PARTNERSHIPTogether AI2026-06-03

Together AI Named Preferred Cloud Partner for MiniMax M3, Delivers Substantial Inference Optimizations

Key Takeaways

  • ▸Together AI named preferred cloud partner for MiniMax M3; will host as developer endpoint following public release
  • ▸Together AI's optimization work delivered 81–125% throughput improvements via specialized sparse attention kernels and multimodal preprocessing
  • ▸MiniMax M3 features a 1M-token context window, native multimodal support, and state-of-the-art coding and agentic performance
Source:
Hacker Newshttps://www.together.ai/blog/serving-minimax-m3-for-efficient-inference-unlocking-1m-token-context-and-multimodality-without-regrets↗

Summary

Together AI announced it has become the preferred cloud infrastructure partner for MiniMax's newly launched M3 model, a state-of-the-art large language model featuring a 1-million-token context window, native multimodal capabilities, and strong performance on coding and agentic workflows. Together AI will host M3 as a developer endpoint upon its public release. The partnership marks a significant validation of Together AI's capabilities in serving frontier AI models at scale.

To support efficient production deployment of MiniMax M3, Together AI's Inference and Kernel teams developed substantial technical optimizations specifically tailored to the model's unique architecture. These optimizations include a KV-Block-Major sparse attention kernel, a novel paged attention integration for MiniMax's Sparse Attention (MSA) mechanism, a highly optimized index scoring kernel, and a Rust-based multimodal preprocessing gateway. The optimizations collectively deliver 81–125% throughput improvements across different concurrency levels.

MiniMax M3's core innovation is its Sparse Attention architecture (MSA), which addresses long-context processing challenges by limiting the number of tokens each query attends to, reducing computational complexity from O(N²) to a more manageable order. This architectural breakthrough enables the model to support a 1-million-token context window while remaining economical to serve. Together AI's optimization work ensures that M3 can be deployed efficiently in production despite these advanced capabilities, achieving speedups of over 9x in the prefilling stage and 15x in the decoding stage.

The partnership demonstrates the growing importance of specialized inference infrastructure for deploying frontier AI models. Together AI's ability to optimize and serve complex models like M3 at production scale validates their position as a critical infrastructure partner for AI companies pushing the boundaries of long-context reasoning, multimodal understanding, and agentic workflows.

  • Partnership highlights the critical infrastructure engineering required to deploy advanced LLMs at production scale
Large Language Models (LLMs)Generative AIMLOps & InfrastructurePartnerships

More from Together AI

Together AITogether AI
RESEARCH

Together AI Introduces Subconscious Cache to Optimize Agent Inference and Context Handling

2026-06-03
Together AITogether AI
OPEN SOURCE

Together AI Releases CoderForge-Preview, Largest Open Dataset for Training Coding Agents

2026-02-25

Comments

Suggested

Google / AlphabetGoogle / Alphabet
PRODUCT LAUNCH

Google Launches Gemini Spark, Ambitious AI Agent That Learns Your Intimate Personal Details

2026-06-03
Red HatRed Hat
INDUSTRY REPORT

Red Hat's npm Packages Compromised in Credential-Stealing Supply Chain Attack

2026-06-03
Research CommunityResearch Community
RESEARCH

AI Agents Enable Adaptive Computer Worms: New Cybersecurity Threat Emerges

2026-06-03
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us