Microsoft Releases Three New Multimodal AI Models: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2
Key Takeaways
- ▸MAI-Transcribe-1 delivers state-of-the-art speech-to-text with 2.5x faster speed than Azure Fast and support for 25 major languages at the best price-performance ratio among cloud providers
- ▸MAI-Voice-1 enables realistic voice generation with emotional nuance and now allows custom voice creation from seconds of audio, capable of generating 60 seconds of audio per second
- ▸MAI-Image-2 achieves 2x faster image generation speeds with improved visual quality for photographers and designers, already in use by major enterprise partners like WPP
Summary
Microsoft has announced the launch of three new multimodal AI models available on its Microsoft Foundry platform: MAI-Transcribe-1 for speech-to-text transcription, MAI-Voice-1 for voice generation, and MAI-Image-2 for image generation. These models are positioned as delivering world-class quality with significant performance improvements over existing alternatives. MAI-Transcribe-1 supports the top 25 most-used languages and offers 2.5x faster batch transcription speed compared to Microsoft Azure's Fast offering, while MAI-Voice-1 can generate 60 seconds of high-quality audio in just one second and now supports custom voice creation from short audio samples. MAI-Image-2 demonstrates at least 2x faster generation times on Foundry and Copilot compared to previous versions, with enhanced capabilities for natural lighting, skin tone accuracy, and text rendering. All three models are priced competitively, with MAI-Transcribe-1 starting at $0.36 per hour, MAI-Voice-1 at $22 per 1M characters, and MAI-Image-2 at $5 per 1M input tokens and $33 per 1M output tokens. The models are now available to developers through Microsoft Foundry and the MAI Playground.
- All three models emphasize competitive pricing paired with superior performance, reflecting Microsoft's strategy of competing on quality, speed, and cost simultaneously
Editorial Opinion
Microsoft's announcement of these three multimodal models represents a significant expansion of its AI capabilities across speech, voice, and vision domains. By emphasizing speed, quality, and affordability simultaneously—backed by real production metrics and enterprise adoption—Microsoft is positioning itself to compete directly with specialized AI providers. The focus on practical applications, from enterprise creative work with WPP to developer accessibility through Foundry, suggests Microsoft is betting on democratizing advanced multimodal AI rather than limiting it to premium tiers.



