BotBeat
...
← Back

> ▌

MicrosoftMicrosoft
PRODUCT LAUNCHMicrosoft2026-04-05

Microsoft Releases Three New Multimodal AI Models: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2

Key Takeaways

  • ▸MAI-Transcribe-1 delivers state-of-the-art speech-to-text with 2.5x faster speed than Azure Fast and support for 25 major languages at the best price-performance ratio among cloud providers
  • ▸MAI-Voice-1 enables realistic voice generation with emotional nuance and now allows custom voice creation from seconds of audio, capable of generating 60 seconds of audio per second
  • ▸MAI-Image-2 achieves 2x faster image generation speeds with improved visual quality for photographers and designers, already in use by major enterprise partners like WPP
Source:
Hacker Newshttps://microsoft.ai/news/today-were-announcing-3-new-world-class-mai-models-available-in-foundry/↗

Summary

Microsoft has announced the launch of three new multimodal AI models available on its Microsoft Foundry platform: MAI-Transcribe-1 for speech-to-text transcription, MAI-Voice-1 for voice generation, and MAI-Image-2 for image generation. These models are positioned as delivering world-class quality with significant performance improvements over existing alternatives. MAI-Transcribe-1 supports the top 25 most-used languages and offers 2.5x faster batch transcription speed compared to Microsoft Azure's Fast offering, while MAI-Voice-1 can generate 60 seconds of high-quality audio in just one second and now supports custom voice creation from short audio samples. MAI-Image-2 demonstrates at least 2x faster generation times on Foundry and Copilot compared to previous versions, with enhanced capabilities for natural lighting, skin tone accuracy, and text rendering. All three models are priced competitively, with MAI-Transcribe-1 starting at $0.36 per hour, MAI-Voice-1 at $22 per 1M characters, and MAI-Image-2 at $5 per 1M input tokens and $33 per 1M output tokens. The models are now available to developers through Microsoft Foundry and the MAI Playground.

  • All three models emphasize competitive pricing paired with superior performance, reflecting Microsoft's strategy of competing on quality, speed, and cost simultaneously

Editorial Opinion

Microsoft's announcement of these three multimodal models represents a significant expansion of its AI capabilities across speech, voice, and vision domains. By emphasizing speed, quality, and affordability simultaneously—backed by real production metrics and enterprise adoption—Microsoft is positioning itself to compete directly with specialized AI providers. The focus on practical applications, from enterprise creative work with WPP to developer accessibility through Foundry, suggests Microsoft is betting on democratizing advanced multimodal AI rather than limiting it to premium tiers.

Computer VisionGenerative AIMultimodal AISpeech & AudioProduct Launch

More from Microsoft

MicrosoftMicrosoft
OPEN SOURCE

Microsoft Releases Agent Governance Toolkit: Open-Source Runtime Security for AI Agents

2026-04-05
MicrosoftMicrosoft
POLICY & REGULATION

Microsoft's Copilot Terms Reveal Entertainment-Only Classification Despite Business Integration

2026-04-05
MicrosoftMicrosoft
PRODUCT LAUNCH

Microsoft Launches Comprehensive Agent Framework for Building and Orchestrating AI Agents

2026-04-04

Comments

Suggested

Not SpecifiedNot Specified
PRODUCT LAUNCH

AI Software Slashes MRI Scan Time by 61% at Amsterdam Cancer Center, Boosting Hospital Capacity

2026-04-05
LMMs-LabLMMs-Lab
PRODUCT LAUNCH

LMMs-Lab Releases Writer: Open-Source AI-Native LaTeX Editor with Git Integration

2026-04-05
ELM LabsELM Labs
PRODUCT LAUNCH

Onepilot Launches Mobile-First AI Agent IDE for iPhone, Enabling Developers to Deploy Coding Agents from Anywhere

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us