Microsoft Releases Three New Multimodal AI Models: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2

Key Takeaways

▸MAI-Transcribe-1 delivers state-of-the-art speech-to-text with 2.5x faster speed than Azure Fast and support for 25 major languages at the best price-performance ratio among cloud providers
▸MAI-Voice-1 enables realistic voice generation with emotional nuance and now allows custom voice creation from seconds of audio, capable of generating 60 seconds of audio per second
▸MAI-Image-2 achieves 2x faster image generation speeds with improved visual quality for photographers and designers, already in use by major enterprise partners like WPP

Source:

Hacker Newshttps://microsoft.ai/news/today-were-announcing-3-new-world-class-mai-models-available-in-foundry/↗

Summary

Microsoft has announced the launch of three new multimodal AI models available on its Microsoft Foundry platform: MAI-Transcribe-1 for speech-to-text transcription, MAI-Voice-1 for voice generation, and MAI-Image-2 for image generation. These models are positioned as delivering world-class quality with significant performance improvements over existing alternatives. MAI-Transcribe-1 supports the top 25 most-used languages and offers 2.5x faster batch transcription speed compared to Microsoft Azure's Fast offering, while MAI-Voice-1 can generate 60 seconds of high-quality audio in just one second and now supports custom voice creation from short audio samples. MAI-Image-2 demonstrates at least 2x faster generation times on Foundry and Copilot compared to previous versions, with enhanced capabilities for natural lighting, skin tone accuracy, and text rendering. All three models are priced competitively, with MAI-Transcribe-1 starting at $0.36 per hour, MAI-Voice-1 at $22 per 1M characters, and MAI-Image-2 at $5 per 1M input tokens and $33 per 1M output tokens. The models are now available to developers through Microsoft Foundry and the MAI Playground.

All three models emphasize competitive pricing paired with superior performance, reflecting Microsoft's strategy of competing on quality, speed, and cost simultaneously

Editorial Opinion

Microsoft's announcement of these three multimodal models represents a significant expansion of its AI capabilities across speech, voice, and vision domains. By emphasizing speed, quality, and affordability simultaneously—backed by real production metrics and enterprise adoption—Microsoft is positioning itself to compete directly with specialized AI providers. The focus on practical applications, from enterprise creative work with WPP to developer accessibility through Foundry, suggests Microsoft is betting on democratizing advanced multimodal AI rather than limiting it to premium tiers.

Microsoft Releases Three New Multimodal AI Models: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2

Key Takeaways

▸MAI-Transcribe-1 delivers state-of-the-art speech-to-text with 2.5x faster speed than Azure Fast and support for 25 major languages at the best price-performance ratio among cloud providers
▸MAI-Voice-1 enables realistic voice generation with emotional nuance and now allows custom voice creation from seconds of audio, capable of generating 60 seconds of audio per second
▸MAI-Image-2 achieves 2x faster image generation speeds with improved visual quality for photographers and designers, already in use by major enterprise partners like WPP

Summary

All three models emphasize competitive pricing paired with superior performance, reflecting Microsoft's strategy of competing on quality, speed, and cost simultaneously

Editorial Opinion

Microsoft's announcement of these three multimodal models represents a significant expansion of its AI capabilities across speech, voice, and vision domains. By emphasizing speed, quality, and affordability simultaneously—backed by real production metrics and enterprise adoption—Microsoft is positioning itself to compete directly with specialized AI providers. The focus on practical applications, from enterprise creative work with WPP to developer accessibility through Foundry, suggests Microsoft is betting on democratizing advanced multimodal AI rather than limiting it to premium tiers.

Microsoft Releases Three New Multimodal AI Models: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2

Key Takeaways

Summary

Editorial Opinion

More from Microsoft

AI Red Teaming Agents Transform LLM Security Testing with Automated Assessment

GitHub Copilot Shifts to Usage-Based Billing Starting June 1, 2026

Microsoft Releases Comprehensive Guidelines for Human-AI Interaction Based on 20+ Years of Research

Comments

Suggested

Baidu Open-Sources LoongForge, High-Performance Training Framework with Up to 5× Speedup

Demos Study Finds ChatGPT and Other AI Chatbots Spread Misinformation During Scottish Election

Google Launches AI-Powered Ad Formats in AI Mode Search Results

Microsoft Releases Three New Multimodal AI Models: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2

Key Takeaways

Summary

Editorial Opinion

More from Microsoft

AI Red Teaming Agents Transform LLM Security Testing with Automated Assessment

GitHub Copilot Shifts to Usage-Based Billing Starting June 1, 2026

Microsoft Releases Comprehensive Guidelines for Human-AI Interaction Based on 20+ Years of Research

Comments

Suggested

Baidu Open-Sources LoongForge, High-Performance Training Framework with Up to 5× Speedup

Demos Study Finds ChatGPT and Other AI Chatbots Spread Misinformation During Scottish Election

Google Launches AI-Powered Ad Formats in AI Mode Search Results