BotBeat
...
← Back

> ▌

MicrosoftMicrosoft
PRODUCT LAUNCHMicrosoft2026-04-05

Microsoft Releases Three New Multimodal AI Models: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2

Key Takeaways

  • ▸MAI-Transcribe-1 delivers state-of-the-art speech-to-text with 2.5x faster speed than Azure Fast and support for 25 major languages at the best price-performance ratio among cloud providers
  • ▸MAI-Voice-1 enables realistic voice generation with emotional nuance and now allows custom voice creation from seconds of audio, capable of generating 60 seconds of audio per second
  • ▸MAI-Image-2 achieves 2x faster image generation speeds with improved visual quality for photographers and designers, already in use by major enterprise partners like WPP
Source:
Hacker Newshttps://microsoft.ai/news/today-were-announcing-3-new-world-class-mai-models-available-in-foundry/↗

Summary

Microsoft has announced the launch of three new multimodal AI models available on its Microsoft Foundry platform: MAI-Transcribe-1 for speech-to-text transcription, MAI-Voice-1 for voice generation, and MAI-Image-2 for image generation. These models are positioned as delivering world-class quality with significant performance improvements over existing alternatives. MAI-Transcribe-1 supports the top 25 most-used languages and offers 2.5x faster batch transcription speed compared to Microsoft Azure's Fast offering, while MAI-Voice-1 can generate 60 seconds of high-quality audio in just one second and now supports custom voice creation from short audio samples. MAI-Image-2 demonstrates at least 2x faster generation times on Foundry and Copilot compared to previous versions, with enhanced capabilities for natural lighting, skin tone accuracy, and text rendering. All three models are priced competitively, with MAI-Transcribe-1 starting at $0.36 per hour, MAI-Voice-1 at $22 per 1M characters, and MAI-Image-2 at $5 per 1M input tokens and $33 per 1M output tokens. The models are now available to developers through Microsoft Foundry and the MAI Playground.

  • All three models emphasize competitive pricing paired with superior performance, reflecting Microsoft's strategy of competing on quality, speed, and cost simultaneously

Editorial Opinion

Microsoft's announcement of these three multimodal models represents a significant expansion of its AI capabilities across speech, voice, and vision domains. By emphasizing speed, quality, and affordability simultaneously—backed by real production metrics and enterprise adoption—Microsoft is positioning itself to compete directly with specialized AI providers. The focus on practical applications, from enterprise creative work with WPP to developer accessibility through Foundry, suggests Microsoft is betting on democratizing advanced multimodal AI rather than limiting it to premium tiers.

Computer VisionGenerative AIMultimodal AISpeech & AudioProduct Launch

More from Microsoft

MicrosoftMicrosoft
RESEARCH

Microsoft's Leaked 'Aion' Project Reveals Vision for Copilot-First Operating System

2026-07-04
MicrosoftMicrosoft
PRODUCT LAUNCH

Microsoft Launches $2.5B Frontier Company for Enterprise AI Deployments

2026-07-02
MicrosoftMicrosoft
RESEARCH

Microsoft's Leaked 'Project Aion' Reveals Radical Copilot-First OS Without Start Menu

2026-07-02

Comments

Suggested

CloudflareCloudflare
OPEN SOURCE

Cloudflare Launches Agentic Inbox: Self-Hosted Email Client with Built-In AI Agent

2026-07-05
MidjourneyMidjourney
RESEARCH

Midjourney and Other AI Image Generators Perpetuate Global Stereotypes, Analysis Reveals

2026-07-05
HarveyHarvey
FUNDING & BUSINESS

Harvey AI Reaches $11 Billion Valuation After Rising from Reddit Origins

2026-07-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us