Microsoft AI Announces Three New Multimodal Models: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2
Key Takeaways
- ▸Three new multimodal AI models (transcription, voice, and image generation) are now available in Microsoft Foundry with significant performance improvements over existing offerings
- ▸MAI-Transcribe-1 delivers 2.5x faster transcription speeds across 25 languages; MAI-Voice-1 generates realistic speech with custom voice creation; MAI-Image-2 provides 2x faster image generation
- ▸Models are priced competitively compared to cloud providers and prioritize efficiency gains without sacrificing quality
Summary
Microsoft AI has announced three new advanced models available through Microsoft Foundry: MAI-Transcribe-1 for speech-to-text transcription, MAI-Voice-1 for voice generation, and MAI-Image-2 for image generation. MAI-Transcribe-1 delivers state-of-the-art performance across 25 languages with 2.5x faster batch transcription speeds than existing Azure offerings. MAI-Voice-1 enables custom voice creation from just seconds of audio and can generate 60 seconds of speech in a single second, while MAI-Image-2 provides 2x faster image generation with improved quality for creative professionals.
All three models are positioned as offering superior performance compared to competitors at competitive pricing tiers. MAI-Transcribe-1 starts at $0.36 per hour, MAI-Voice-1 at $22 per million characters, and MAI-Image-2 at $5 per million tokens for text input and $33 per million tokens for image output. Early enterprise adoption includes WPP, a major marketing and communications group, which is already utilizing MAI-Image-2 for campaign-ready creative work at scale.
- Early enterprise adoption from WPP demonstrates commercial viability for creative and marketing applications
Editorial Opinion
Microsoft's announcement of these three new models represents a significant push to democratize access to high-quality multimodal AI capabilities through Foundry. The emphasis on balancing speed, quality, and affordability directly challenges competitors by removing the traditional trade-offs developers face. If the performance claims hold up in production environments, this could accelerate enterprise adoption of AI-powered features across voice, transcription, and image generation use cases.



