Microsoft VibeVoice: Open-Source Frontier Voice AI
Microsoft has introduced VibeVoice, an open-source voice AI framework that includes both speech recognition and text-to-speech models. The VibeVoice-ASR model can process long-form audio and generate structured transcriptions, while the VibeVoice-TTS model supports multi-speaker dialogues. Both models are designed to enhance collaboration in the speech synthesis community and are now available through the Hugging Face Transformers library.
- ▪VibeVoice-ASR is a unified speech-to-text model capable of handling 60-minute long-form audio in a single pass.
- ▪The VibeVoice-TTS model can synthesize speech for up to 90 minutes with support for multiple speakers.
- ▪VibeVoice employs innovative continuous speech tokenizers to improve audio fidelity and computational efficiency.
Opening excerpt (first ~120 words) tap to expand
🎙️ VibeVoice: Open-Source Frontier Voice AI 📰 News 2026-03-06: 🚀 VibeVoice ASR is now part of a Transformers release! You can now use our speech recognition model directly through the Hugging Face Transformers library for seamless integration into your projects. 2026-01-21: 📣 We open-sourced VibeVoice-ASR, a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context. Try it in Playground. ⭐️ VibeVoice-ASR is natively multilingual, supporting over 50 languages — check the supported languages for details.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at GitHub.