VibeVoice: Microsoft's Open-Source Voice AI Suite Reaches Hugging Face Transformers

Key Takeaways

▸VibeVoice-ASR is now available through Hugging Face Transformers, enabling seamless integration for developers building speech-to-text applications
▸Both ASR and TTS models support long-form processing (60+ minutes for ASR, 90 minutes for TTS) with multilingual support across 50+ languages
▸Innovative continuous speech tokenizers at 7.5 Hz frame rate combined with LLM and diffusion frameworks enable high-fidelity audio with computational efficiency

Source:

Hacker Newshttps://github.com/microsoft/VibeVoice↗

Summary

Microsoft has released VibeVoice, a comprehensive open-source framework for voice AI that includes both automatic speech recognition (ASR) and text-to-speech (TTS) models. The latest milestone came on March 6, 2026, when VibeVoice-ASR was integrated into the Hugging Face Transformers library, enabling seamless integration into developer projects and democratizing access to advanced speech processing capabilities for the broader AI community.

The VibeVoice suite represents a significant advance in long-form audio processing. VibeVoice-ASR can handle 60-minute audio files in a single pass while supporting over 50 languages, with features like speaker diarization, timestamping, and customized hotword recognition. Meanwhile, VibeVoice-Realtime-0.5B provides real-time text-to-speech generation with support for multiple languages and speaking styles. Both models leverage continuous speech tokenizers operating at 7.5 Hz, combined with LLM and diffusion-based architectures for superior audio quality and computational efficiency.

Since open-sourcing the framework beginning in August 2025, Microsoft has progressively enhanced the VibeVoice ecosystem with fine-tuning code, vLLM inference support for faster processing, expanded multilingual capabilities, and technical reports. The commitment to open-source development, coupled with responsible AI principles demonstrated by Microsoft's proactive approach to misuse prevention, positions VibeVoice as a foundational tool for voice AI research and deployment across industries.

Complete open-source suite includes fine-tuning code, vLLM optimization, and published technical reports; models are available on Hugging Face and in interactive playgrounds

Microsoft

OPEN SOURCE Microsoft2026-04-28

VibeVoice: Microsoft's Open-Source Voice AI Suite Reaches Hugging Face Transformers

Key Takeaways

▸VibeVoice-ASR is now available through Hugging Face Transformers, enabling seamless integration for developers building speech-to-text applications
▸Both ASR and TTS models support long-form processing (60+ minutes for ASR, 90 minutes for TTS) with multilingual support across 50+ languages
▸Innovative continuous speech tokenizers at 7.5 Hz frame rate combined with LLM and diffusion frameworks enable high-fidelity audio with computational efficiency

Source:

Hacker Newshttps://github.com/microsoft/VibeVoice↗

Summary

Complete open-source suite includes fine-tuning code, vLLM optimization, and published technical reports; models are available on Hugging Face and in interactive playgrounds

VibeVoice: Microsoft's Open-Source Voice AI Suite Reaches Hugging Face Transformers

Key Takeaways

Summary

More from Microsoft

Microsoft Launches Project Perception, an Agentic Security System for AI-Era Threats

Microsoft Launches MAI-Cyber-1-Flash, Claims 50% Cost Savings on Vulnerability Detection

Microsoft Releases Lightweight Multimodal Foundation Model for Image and Video Understanding

Comments

Suggested

Researchers Sound Alarm on Fragile Foundations of Chain-of-Thought Monitoring for AI Safety

Tessera Adds Mixture of Experts Model Support to LoRA Adapter Generator

PerceptionBench Reveals Critical Gaps in Multimodal AI Visual Perception—No Model Exceeds 60% Accuracy

VibeVoice: Microsoft's Open-Source Voice AI Suite Reaches Hugging Face Transformers

Key Takeaways

Summary

More from Microsoft

Microsoft Launches Project Perception, an Agentic Security System for AI-Era Threats

Microsoft Launches MAI-Cyber-1-Flash, Claims 50% Cost Savings on Vulnerability Detection

Microsoft Releases Lightweight Multimodal Foundation Model for Image and Video Understanding

Comments

Suggested

Researchers Sound Alarm on Fragile Foundations of Chain-of-Thought Monitoring for AI Safety

Tessera Adds Mixture of Experts Model Support to LoRA Adapter Generator

PerceptionBench Reveals Critical Gaps in Multimodal AI Visual Perception—No Model Exceeds 60% Accuracy