BotBeat
...
← Back

> ▌

MicrosoftMicrosoft
PRODUCT LAUNCHMicrosoft2026-04-05

Microsoft Releases Three New Multimodal AI Models: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2

Key Takeaways

  • ▸MAI-Transcribe-1 delivers state-of-the-art speech-to-text with 2.5x faster speed than Azure Fast and support for 25 major languages at the best price-performance ratio among cloud providers
  • ▸MAI-Voice-1 enables realistic voice generation with emotional nuance and now allows custom voice creation from seconds of audio, capable of generating 60 seconds of audio per second
  • ▸MAI-Image-2 achieves 2x faster image generation speeds with improved visual quality for photographers and designers, already in use by major enterprise partners like WPP
Source:
Hacker Newshttps://microsoft.ai/news/today-were-announcing-3-new-world-class-mai-models-available-in-foundry/↗

Summary

Microsoft has announced the launch of three new multimodal AI models available on its Microsoft Foundry platform: MAI-Transcribe-1 for speech-to-text transcription, MAI-Voice-1 for voice generation, and MAI-Image-2 for image generation. These models are positioned as delivering world-class quality with significant performance improvements over existing alternatives. MAI-Transcribe-1 supports the top 25 most-used languages and offers 2.5x faster batch transcription speed compared to Microsoft Azure's Fast offering, while MAI-Voice-1 can generate 60 seconds of high-quality audio in just one second and now supports custom voice creation from short audio samples. MAI-Image-2 demonstrates at least 2x faster generation times on Foundry and Copilot compared to previous versions, with enhanced capabilities for natural lighting, skin tone accuracy, and text rendering. All three models are priced competitively, with MAI-Transcribe-1 starting at $0.36 per hour, MAI-Voice-1 at $22 per 1M characters, and MAI-Image-2 at $5 per 1M input tokens and $33 per 1M output tokens. The models are now available to developers through Microsoft Foundry and the MAI Playground.

  • All three models emphasize competitive pricing paired with superior performance, reflecting Microsoft's strategy of competing on quality, speed, and cost simultaneously

Editorial Opinion

Microsoft's announcement of these three multimodal models represents a significant expansion of its AI capabilities across speech, voice, and vision domains. By emphasizing speed, quality, and affordability simultaneously—backed by real production metrics and enterprise adoption—Microsoft is positioning itself to compete directly with specialized AI providers. The focus on practical applications, from enterprise creative work with WPP to developer accessibility through Foundry, suggests Microsoft is betting on democratizing advanced multimodal AI rather than limiting it to premium tiers.

Computer VisionGenerative AIMultimodal AISpeech & AudioProduct Launch

More from Microsoft

MicrosoftMicrosoft
RESEARCH

AI Red Teaming Agents Transform LLM Security Testing with Automated Assessment

2026-05-21
MicrosoftMicrosoft
UPDATE

GitHub Copilot Shifts to Usage-Based Billing Starting June 1, 2026

2026-05-20
MicrosoftMicrosoft
RESEARCH

Microsoft Releases Comprehensive Guidelines for Human-AI Interaction Based on 20+ Years of Research

2026-05-20

Comments

Suggested

BaiduBaidu
OPEN SOURCE

Baidu Open-Sources LoongForge, High-Performance Training Framework with Up to 5× Speedup

2026-05-21
OpenAIOpenAI
RESEARCH

Demos Study Finds ChatGPT and Other AI Chatbots Spread Misinformation During Scottish Election

2026-05-21
Google / AlphabetGoogle / Alphabet
UPDATE

Google Launches AI-Powered Ad Formats in AI Mode Search Results

2026-05-21
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us