State of the Art Speech Recognition with MAI-Transcribe-1
State of the Art Speech Recognition with MAI-Transcribe-1
Meet MAI-Transcribe-1, the most accurate transcription model in the world across 25 languages.
Speech is the most natural way humans communicate, often in noisy environments – conference rooms, phone lines, busy streets – across many languages. Today we’re introducing MAI-Transcribe-1, a robust and efficient multilingual speech-to-text model that gives developers building global products a single model that scales well across languages, accents, and production environments. MAI-Transcribe-1 is now available on Microsoft Foundry.
Best-in-class accuracy on FLEURS
MAI-Transcribe-1 achieves the lowest Word Error Rate against competitive speech-to-text models. On FLEURS (25 languages), it outperforms Scribe v2, Whisper-large-V3, GPT-Transcribe, and Gemini 3.1 Flash-Lite.
World class quality across 25 languages
The model maintains competitively high accuracy across all 25 supported languages, making it adaptable for global products and resilient to a wide range of accents or speaking styles.
*Lower is better
*Lower is better
Incredible speed and efficiency
Our model delivers incredible batch transcription speeds 2.5x faster than our current Microsoft Azure Fast offering. We know that speed and efficiency are essential for all production workloads. We’ve worked incredibly hard to ensure that we deliver this lightning fast performance whilst maintaining SOTA performance across 25 languages.
Outstanding performance in noisy environments
Benchmarks are only part of the story. When it comes to production use cases such as voice agents, meeting transcription, and call center analytics, audio is rarely clean. MAI-Transcribe-1 was built with challenging recording conditions in mind, reliably handling background noise, low-quality audio recordings, and overlapping speech.
Cafe scenario
TRANSCRIPTION:
Hey, so I was hoping to change my flight, if that’s at all possible. It’s currently set for 10 p.m. tonight, but I’m really trying to switch to something earlier, ideally sometime before 6 p.m. Is that something we could maybe look into?
Office Scenario (in Spanish)
TRANSCRIPTION:
Bueno, ya estamos listos, ¿no? Eh, ¿podemos please checar que esté prendido my transcribe one? Sí, está. Super. Entonces, vamos a empezar. Oh, oh, someone else is joining us. Oh, hello. Please come in. Join us. We’ll switch to English, no problem. Sí, sí, sí. Bienvenido. Bienvenido.
Concert scenario
TRANSCRIPTION:
Okay, listen. I have this absolutely unhinged idea, and I need you to roll with me. Help me make an agent that will literally buy tickets for my favorite band the second they are available.
The best price-to-performance of any large cloud provider
We are passing efficiency gains directly to customers: MAI-Transcribe-1 is priced at $0.36 per hour of audio, setting the standard for quality, speed, and price for production ASR.
Powering Microsoft Products
MAI-Transcribe-1 is in phased rollouts with Copilot’s Voice mode and Microsoft Teams to provide accurate conversation transcripts, that can be used for various downstream tasks.
Build with MAI-Transcribe-1
MAI-Transcribe-1 is now in public preview on Microsoft Foundry.
You can also experience MAI-Transcribe-1 in the newly launched Microsoft AI Playground.
MAI-Transcribe-1 delivers latency low enough for a wide range of use cases while providing very high accuracy.
Offline applications
MAI-Transcribe-1 supports a wide range of applications, from media and content tasks such as subtitle generation, podcast transcription, and video accessibility, to enterprise needs such as meeting archives, compliance recording, and legal discovery. It can also power analytics workflows, including call center QA, customer insight extraction, and searchable audio libraries, as well as large scale data pipelines for processing audio archives used in ML training, search indexing, and summarization.
Online applications
Low latency also makes MAI-Transcribe-1 a good choice for real-time tasks. Be it meeting transcription, video close captioning, or dictation.
Voice Agents: The complete stack
If you’re building a voice agent, MAI-Transcribe-1 is the foundational layer. Accurate transcription is what allows underlying LLMs to interpret intent effectively. It directly shapes user satisfaction and task completion rates.
By combining MAI-Transcribe-1 (speech-to-text) with MAI-Voice-1 (text-to-speech) and your chosen LLM you can build a robust solution to power voice experiences.
Model Card