BotBeat
...
← Back

> ▌

MicrosoftMicrosoft
PRODUCT LAUNCHMicrosoft2026-04-02

Microsoft AI Launches MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 Models in Foundry

Key Takeaways

  • ▸MAI-Transcribe-1 achieves state-of-the-art multilingual speech-to-text with 2.5x faster processing than comparable Azure services at competitive pricing
  • ▸MAI-Voice-1 enables custom voice generation from minimal audio samples with 60-second generation capability in a single second, supporting voice agent development
  • ▸MAI-Image-2 doubles generation speed with improved quality for creative professionals, demonstrated by early enterprise adoption from WPP and rollout across Copilot, Bing, and PowerPoint
Sources:
Hacker Newshttps://microsoft.ai/news/today-were-announcing-3-new-world-class-mai-models-available-in-foundry/↗
Hacker Newshttps://microsoft.ai/news/state-of-the-art-speech-recognition-with-mai-transcribe-1/↗

Summary

Microsoft AI has announced three new multimodal AI models now available in Microsoft Foundry: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2. MAI-Transcribe-1 delivers state-of-the-art speech-to-text transcription across 25 languages with 2.5x faster batch processing than existing Azure offerings, starting at $0.36 per hour. MAI-Voice-1, the company's top-tier voice generation model, can now create custom voices from just a few seconds of audio and generate 60 seconds of speech in a single second, priced at $22 per 1M characters.

MAI-Image-2 represents a significant performance upgrade with at least 2x faster generation times on Foundry and Copilot while maintaining quality, and is gaining traction with enterprise partners including WPP, one of the world's largest marketing groups. All three models emphasize competitive pricing and efficiency, with Microsoft positioning them as superior alternatives to competitors in terms of speed, quality, and cost. The models are designed with human-centric principles, optimizing for natural communication and real-world use cases including creative professionals, developers, and enterprise applications.

  • All three models emphasize competitive pricing and efficiency, addressing quality-speed-cost tradeoffs that Microsoft claims outperform competitors

Editorial Opinion

Microsoft's simultaneous launch of three multimodal models demonstrates a comprehensive strategy to compete across speech, voice, and image AI spaces. The emphasis on speed, quality, and affordability—particularly the claim of better performance than competitors at lower cost—positions these models attractively for enterprise adoption. However, the comparison claims warrant scrutiny, and the actual real-world performance will ultimately determine whether MAI models live up to the promise of being genuinely superior across all three dimensions simultaneously.

Computer VisionNatural Language Processing (NLP)Generative AIMultimodal AISpeech & AudioProduct Launch

More from Microsoft

MicrosoftMicrosoft
PRODUCT LAUNCH

Microsoft Launches Comprehensive Agent Framework for Building and Orchestrating AI Agents

2026-04-04
MicrosoftMicrosoft
POLICY & REGULATION

Microsoft's Own Terms Reveal Copilot Is 'For Entertainment Purposes Only' and Cannot Be Trusted for Important Decisions

2026-04-03
MicrosoftMicrosoft
PRODUCT LAUNCH

Microsoft AI Announces Three New Multimodal Models: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2

2026-04-03

Comments

Suggested

AnthropicAnthropic
RESEARCH

Inside Claude Code's Dynamic System Prompt Architecture: Anthropic's Complex Context Engineering Revealed

2026-04-05
GitHubGitHub
PRODUCT LAUNCH

GitHub Launches Squad: Open Source Multi-Agent AI Framework to Simplify Complex Workflows

2026-04-05
PerplexityPerplexity
POLICY & REGULATION

Perplexity's 'Incognito Mode' Called a 'Sham' in Class Action Lawsuit Over Data Sharing with Google and Meta

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us