Introducing MAI-Voice-2

June 2, 2026

Models

Superintelligence team

Today we’re launching MAI-Voice-2 — the most expressive, natural-sounding text-to-speech model we’ve built to date. It’s a significant leap from its predecessor across every dimension that matters to production voice experiences: fidelity, language coverage, speaker consistency, and emotional range. It is built for the products and services where voice quality directly impacts user experience: assistants or customer support that represent your brand, audiobooks that hold attention over hours, and accessibility experiences where voice is the only interface. It’s also built with responsible deployment in mind, with consent guardrails ensuring the technology is as trustworthy as it sounds. MAI-Voice-2 is now available in Microsoft Foundry, and is being integrated into VSCode and the Dynamics 365 Contact Center.

Features and capabilities

Expanding from English‑only to 15 languages while maintaining the same naturalness and expressiveness as English.
Granular emotion control via emotion tags: sad, whispered, excited, etc.
Zero-shot voice prompting using 5-60s of reference audio available for all supported languages, with built-in consent guardrails.
MAI-Voice-2 is preferred over its predecessor MAI-Voice-1 72% of the time.
Stable speaker identity across long-form content – audiobooks, podcasts, lectures.
Code-switching capabilities for select language pairs — such as Hindi-English and Spanish-English — matching the way users naturally mix languages in everyday speech.

Hear it for yourself:

English (emotion: Embarrassed)

So I was just standing there, right? And then (sigh) oh my God, she actually said it to his face. I mean, honestly, good for her.

German (emotion: Confused)

Häh? Warum schicken dir mir eine Mahnung? Das macht keinen Sinn. Ich hab das doch schon vor zwei Wochen bezahlt.

Hindi (emotion: Excited)

अरे यार धीरे बोल, कोई सुन लेगा तो पूरा surprise ही लीक हो जाएगा! इतने साल बाद मुंबई में उससे मिलने वाला हूँ.दिल full Bollywood-mode में है

English (role: Motivational Trainer)

Alright, time to focus. Notice how the egret doesn’t rush the moment, it studies it. Every movement is deliberate, every pause intentional. That’s discipline. That’s control. So when the opportunity appears, you can strike without hesitation. Patience earns the catch.

English (role: Sports Commentator)

With everything on the line, the egret makes its move! Slow through the shallows… watching… waiting… And it’s a sudden strike! Got it! Incredible precision from the long beak! The fish never saw it coming. What a scene! Complete composure under pressure. A masterclass performance here in the pond tonight.

Performance

MAI-Voice-2 generates very natural speech in a controllable way. In side-by-side preference tests, it was preferred over its predecessors 72% of the time. In speaker similarity evaluations, speech generated by MAI-Voice-2 is indistinguishable from recordings of the same voice. Below, you can verify this yourself by trying to identify where the human speech ends and the MAI-Voice-2 output begins.

Bar chart showing MAI-Voice-2 with a 72.1% win rate and MAI-Voice-1 with a 27.9% win rate for overall quality preference out of 2,500 listening tests.

Bar graph showing that, on average across 11 languages, 45.5% of listeners preferred MAI-Voice-2 generated speech, 44% preferred real human recordings, and 10.5% resulted in a tie, out of 2,222 responses.

Guess the human recording vs. MAI‑Voice‑2

Listen to the audio clips below – each blends human recordings with speech generated by MAI‑Voice‑2. Can you tell where the human voice ends and the synthetic voice begins, or vice versa? Or does it sound like one continuous voice?

Human recorded + TTS

Language: English US

Human recorded + TTS

Language: Hindi (India)

Human recorded + TTS

Language: Spanish (Mexico)

TTS + Human recorded

Language: French (France)

Human recorded + TTS

Language: German (Germany)

Supported Languages

We prioritized depth across 15 languages, ensuring for supported languages we support a spectrum of expressive capabilities spanning tonal, pitch accent, stress timed, and syllable timed systems. We plan to continue expanding and refining the expressive range for all supported languages.

MAI-Voice-2 now supports the following languages/locales: English (US), English (Australia), Italian, French, German, Hindi, Spanish (Spain), Spanish (Mexico), Portuguese (Brazil), Portuguese (Portugal), Korean, Chinese (Simplified), Turkish, Russian, Thai, Dutch, Romanian and Hungarian.

In markets where people naturally mix languages, we support code-switching – notably Hindi–English and Spanish–English – reflecting how people actually speak. In internal testing, the model switches languages mid sentence fluidly, without losing prosodic naturalness nor speaker identity.

Hindi + English

Oh my god, just look at this gorgeous sunset! क्या तुमने कभी ऐसा beautiful sky देखा है? It looks just like a painting, with all these stunning colours… गुलाबी, नारंगी, बैंगनी। It’s literally magical 

Spanish (Mexican) + English

Quesadillas, tacos, enchiladas, y guacamole are staples of Mexican cuisine, pero también incluyen ingredients like cilantro, jalapeños, and queso fresco for authentic, traditional, regional preparations. 

Voice Synthesis

Developers can create a custom voice in Microsoft Foundry across all supported languages using just a short reference clip – no retraining or fine tuning required. With only a few seconds of audio (recommended: 5–60 seconds), MAI Voice 2 can generate high quality speech that matches the speaker’s identity, making it easy for companies to bring their own brand voice into products without maintaining a separate voice model.

Consent and Safety

Consent is enforced at the system level: only authorized, licensed voices can be synthesized in production. No unlicensed voice cloning is possible. To gain access to this feature apply here.

Use Cases

Assistants: Branded voices for Copilot, apps, devices, customer support.
Entertainment: Characters for games, podcasts, audiobooks, AR/VR.
Accessibility: Narration for visually impaired users; voice for speech impairments.
Education: Instructors and characters for courses and simulations.
Creators: Turn text into audio with your own voice. No studio required.

Try it out

DuoAI

DuoAI is an experimental experience that gives you a direct way to try MAI‑Voice‑2, MAI‑Transcribe‑1.5, and MAI‑Image‑2.5 models in action – showcasing natural, fluid, expressive dialogue. In the demo, you can engage in a three‑way conversation with two agents and even generate images using MAI‑Image‑2.5. It’s a practical preview of how MAI multimodal models work together to build powerful, customizable voice agents. Try DuoAI now

Note: DuoAI is not meant to showcase the capabilities of the underlying LLM – that component is modular and can be swapped as needed.

You can also explore the models directly in the MAI Playground.

Learn more about MAI-Voice-2

Model card [Link]
Foundry API documentation [Link]
Cookbook [Link]

Build the Future With Us

We’re a lean, fast-moving lab made up of some of the world’s most talented minds. We have an exciting roadmap of compute at MAI, with our next-generation GB200 cluster now operational. And we have an ambitious mission we truly believe in. We’re also fortunate to partner with incredible product teams giving our models the chance to reach billions of users and create immense positive impact. If you’re a brilliant, highly-ambitious and low ego individual, you’ll fit right in—come and join us as we work on our next generation of models!

Explore all jobs

MAI-Image-2.5 launches at No. 2 for image editing on Arena

June 2, 2026

Models

Superintelligence team

MAI-Image-2.5 is our strongest image model yet – and now ranks No. 2 on Arena’s Image Edit leaderboard, ahead of Nano Banana 2.¹

Built for high-quality generation and precise, controllable editing, it brings production-ready image workflows to developers and Microsoft products.

Today, we’re launching MAI-Image-2.5 for maximum fidelity, and MAI-Image-2.5-Flash for fast, scalable production workloads.

Features and capabilities

Step-change in text-to-image quality

MAI-Image-2.5 produces more detailed, coherent images from prompts, with stronger text rendering, product imagery and prompt adherence.

Complex visual reasoning

The model understands scene structure, lighting, scale, and spatial relationships, helping it make edits that fit the image context, such as adding an object with the right perspective and shadows.

Fine-grained edit control

MAI-Image-2.5 supports precise, localized edits, from replacing an object or updating text to removing motion blur, without changing the rest of the image.

Face and identity consistency

MAI-Image-2.5 preserves facial identity across edits, maintaining recognizable likeness even through changes in pose, expression or viewpoint.

Benchmarks

MAI-Image-2.5 achieves Arena scores that surpass GPT-Image-1.5 and Nano Banana Pro 2K, ranking No. 3 for text-to-image and No. 2 on Arena’s image-editing leaderboard.

Across these evaluations, MAI-Image-2.5 demonstrates leading performance in image generation and editing, with strong results across prompt adherence, visual quality, and controlled image modification.

Arena Model Scores

Figure 1. MAI-Image-2.5 Arena scores across all text-to-image categories, compared against MAI-Image-2 and MAI-Image-1 as of June 1st 2026. MAI-Image-2.5 delivers an overall +75 point improvement over MAI-Image-2, with the largest gains in Text Rendering (+107) and Cartoon, Anime & Fantasy (+90).

Bar chart showing MAI-Image-2.5 performance in editing tasks. Green bars indicate it wins most categories like image cleanup, backgrounds, shadows, and text, while competitor wins are fewer. Ties appear in some categories.

Figure 2. MAI-Image-2.5 win rates across 12 editing categories on Arena, evaluated via blind human preference judging against all active models from May 31st to June 1st. Each bar shows the share of matches won by MAI-Image-2.5 (green), won by the competitor (light brown), or judged a tie. Categories are sorted by MAI-Image-2.5 net advantage, defined as (win % minus loss %) descending.. Only categories with ≥100 judged matches are shown; matches where both outputs were rated poor are excluded.

Powering Microsoft products

MAI-Image-2.5 is live on PowerPoint for high-quality image generation and rolling out to OneDrive for precise editing.

In PowerPoint, users can generate presentation-ready visuals and slides from prompts, turning ideas into polished decks faster.

In OneDrive, users can make precise photo edits – removing unwanted distractions, cleaning up backgrounds, and enhancing images while preserving the original scene.

White text on a brown background reads "MAI" in the top left and "Edit MAI-Image-2.5 in OneDrive" in large letters at the bottom left.

Best price-to-performance models

MAI-Image-2.5 is available to developers in Foundry today, delivering premium quality and fine-grained editing control at $5 per 1M text input tokens, $8 per 1M image input tokens, and $47 per 1M image output tokens.

MAI-Image-2.5-Flash offers faster, lower-cost generation and editing at $1.75 per 1M text input tokens, $1.75 per 1M image input tokens, and $19.50 per 1M image output tokens.

Together, they give customers the flexibility to optimize production image workflows for fidelity, speed, or cost, while delivering leading price-to-performance on Arena score.

Safety and limitations

MAI-Image-2.5 includes layered safety guardrails, including prompt and output filtering, to help detect and block harmful or policy-violating content.

Like all image models, MAI-Image-2.5 can reflect biases in its training data and may produce plausible but inaccurate or misleading visual details. Generated images should be reviewed before use in sensitive contexts, including identity, legal, medical, financial, or news-related workflows.

Try it out

MAI-Image-2.5 and MAI-Image-2.5-Flash are now available to developers in Foundry, bringing high-quality image generation and precise, controllable editing to production workflows.

You can also try the models directly in the MAI Playground.

OpenRouter is also making MAI-Image-2.5 available to its developer community:

“We’re excited to bring Microsoft’s MAI models to OpenRouter. MAI-Image-2.5 is one of the strongest image models available today, and expands the set of multimodal capabilities available to developers on OpenRouter. Our goal is simple: when great new models launch, the 9 million developers building on OpenRouter should be able to use them immediately through the same API they already use.”
– Alex Atallah, CEO, OpenRouter

As of June 2, 2026.

Build the Future With Us

Explore all jobs

Our values in operation: Health

June 2, 2026

Partnerships

Superintelligence team

Health is one of the most consequential domains for artificial intelligence. People are increasingly turning to AI for help in moments that matter: interpreting a confusing test result at midnight, deciding whether to seek urgent care, or supporting a loved one through a chronic condition. Better information and support can improve outcomes and make care more equitable.

This is where we demonstrate that our values – Kindness, Trust, Quality, Simplicity, Safety & Security, and Evaluation – are not just marketing slogans. They are operational principles that shape product design, engineering trade-offs, and the partnerships we pursue. We listen closely, test rigorously and improve constantly. That commitment shows up in how we build and evaluate models, how we present answers, and how we design user experiences that reduce anxiety rather than amplify it.

Our research on how people are using our AI to support their health makes the opportunity clear. Across Microsoft AI consumer products we handle tens of millions of health questions every day, and a recent deep analysis of half a million Copilot conversations shows people rely on AI for symptom interpretation, test result explanation, and practical care navigation. That scale and intimacy make health a uniquely important focus for MAI: the potential upside is enormous, and the responsibility is correspondingly high.

Our progress to date demonstrates our values in action. With Microsoft AI Diagnostic Orchestrator (MAI DxO), we showed how an AI system could significantly improve clinical reasoning in complex cases. In research settings, MAI DxO gets to the right answer to some of medicine’s most diagnostically complex and intellectually demanding cases, at an accuracy level above baseline models and the physician cohort we tested against.

Building on this work, our strategic collaboration with Mayo Clinic offers the prospect of making expert medical intelligence accessible to millions more who need it. Together we are developing a frontier AI model designed specifically for healthcare, bringing together Mayo Clinic’s world-leading healthcare expertise, clinical health data and longitudinal insights with our advanced AI, cloud, engineering and superintelligence capabilities. The resulting model will first be deployed within Mayo Clinic’s own environment, the world’s leading hospital system, and will aim to support a broad range of clinical reasoning and healthcare use cases, including earlier diagnoses, more personalized treatment decisions and better patient outcomes. Once validated, organizations worldwide will be able to access the frontier AI health model to better support patients, clinicians and consumers. The health model itself will be owned by Mayo Clinic, reinforcing our mutual commitment to patient trust, clinical rigor and responsible stewardship of clinical data and AI.

Advances in health intelligence must be matched by quality in product development and real world deployment. We recently opened Copilot Health in preview for Microsoft 365 Personal, Family, and Premium subscribers in the US – a phased rollout that will let us learn from real users while maintaining rigorous safety and evaluation. Copilot Health brings together your health records, wearable data, and health history into one place, then applies intelligence to turn them into a coherent story. Delivering value means designing for people: simple interfaces, clear citations, and explanations that help users act. We build trust by designing with representative users like the National Health Council while layering in safeguards and pursuing independent verification: Copilot Health’s development and governance is informed by hundreds of clinicians and external reviewers, and our processes have achieved ISO/IEC 42001 certification for AI management systems. These efforts compound: we expect the frontier AI model for healthcare we create with Mayo Clinic to enhance Copilot Health’s capabilities.

The work ahead is substantial, but the direction is clear. By grounding our work and partnerships in our core values, we have the potential to create trusted AI experiences that can help millions of people when it matters most. With rigorous science, intentional partnerships, and a relentless focus on the people we serve, we are committed to making that future real.

Microsoft Build 2026: MAI keynote transcript

June 2, 2026

Build

Superintelligence team

Since I started working in AI, the compute that we use to train frontier models has increased by a trillion-fold. That’s 12 orders of magnitude in just 15 years.

It’s now clear that a consistent, exponential increase in computation leads to predictable advances in AI capabilities.

In the next few years, we’ll see three more orders of magnitude of compute applied to frontier models.

Intelligence is now a function of compute. Log linear hill-climbing has become the norm. The scaling laws are holding. These are truly extraordinary times.

In this context we at MAI are building towards what we call Humanist Superintelligence.

State of the art AI capabilities that are explicitly designed to serve people and organizations, and not to replace them.

Because the type of AI we build really matters. We need an AI that places humanity first.

That always prioritizes human well-being and human progress.

This is the core philosophy and motivation behind our superintelligence efforts at Microsoft. It shapes everything we do.

And as a platform company, our job – and our commitment – is to keep you developers building at the absolute frontier.

So today we’re very excited to announce a family of seven new models across image, voice, transcription, thinking, and coding.

These are all built with real attention to detail and a commitment to making practical, efficient tools that are tuned for how you work in the real world.

First up: MAI-Image-2.5 and its Flash variant – two super strong models that deliver a step change in quality, now at number 2 on the leaderboards, surpassing the score of Nano Banana 2 on image editing¹

They deliver precision editing with incredible control and consistency.

Flash is for super-efficient production workloads at scale, while 2.5 gives maximum fidelity and professional grade performance.

They’re live in PowerPoint, rolling out to OneDrive², and today they’re also landing on Foundry with market-leading quality per dollar.³

Then we have MAI-Transcribe-1.5. This is the best transcription model in the world, with SOTA accuracy across 43 languages, beating out Gemini and OpenAI models.⁴

We’ve optimized it for real world use so you can produce highly accurate transcripts for your bespoke use cases up to 5x faster than rival models.⁶

It’s now being integrated into Copilot, Teams, GitHub, and Dynamics 365 Contact Centre – and it’s also available in Foundry, where it’s the fastest, most efficient and most cost-effective transcription model of any hyper-scaler.⁵

Paired with that we have MAI-Voice-2, our latest speech generation model.

It has beautiful prosody, native-sounding delivery and fine-grained emotional control, and its available in 15 languages with lots more coming soon.

We’re also announcing Voice-2-Flash – which provides the best value and speed for ultra latency-sensitive Voice Agents, the big thing in 2026.

Next up is our text foundation model, MAI-Thinking-1, our first reasoning model.⁷ It’s exceptionally strong in our target use cases of reasoning and SWE tasks.

It’s a 35B active parameter MoE with a 256K context window. That means it competes in the medium-sized weight class, where it’s certainly punching above its weight.

Independent human raters on Surge prefer it for overall quality in blind side-by-sides versus Sonnet 4.6.

And it’s achieved 97% on AIME 25, the key measure of its general-purpose reasoning abilities.⁹

Most importantly, it’s at 53% on SWE Bench Pro, placing it right alongside Opus 4.6 on one of the toughest coding benchmarks.⁸

We’ve got plenty more work to do as we get it into production so we can hill-climb on many more real-world coding tasks.

What’s most remarkable is that this model has climbed entirely from the bottom, without specifically targeting any of these benchmarks, and with zero distillation.

This is critical. Because it means the model is created with an enterprise-grade, clean and commercially licenced data lineage that you can trust, and put into production with complete confidence.

Finally, I’m incredibly excited to announce MAI-Code-1-Flash, our new inference efficient coding model, especially tuned for VS Code and GitHub Copilot CLI.

Code-1-Flash achieves 51% on SWE Bench Pro¹⁰, despite having just 5B parameters, putting it closer to Haiku in size but cheaper in cost¹¹, but delivering very strong coding performance, and great inference efficiency. It’s rolling out today as one of the default models in VS Code.¹²

Alongside distribution on Foundry and optimization for our 1P products, our models are also going to be widely available for developers on Open Router, as well as Fireworks and Baseten.¹³ This means for the first time you will be able to tune the weights directly yourself.

Across this entire family, safety and security are built in from the start.

Voice models include protections against unauthorized cloning, and all outputs are watermarked.

We’ve reduced over-refusals and improved representation, including for people with disabilities.

We’re also publishing a detailed technical report alongside this release to give you a full and transparent understanding of how we put this together.

And we’re also co-designing our models with our own silicon, optimizing MAI-Thinking-1 on our Maia 200 chip and benchmarking it head-to-head against the GB200.

On top of the 30% improvement that Satya just mentioned, we’re seeing a further 1.4x performance-per-watt gain when running our MAI models on the Maia 200 end to end. This is huge.

Every watt counts at this scale, and silicon-model co-design is a key advantage, helping us deliver you the most efficient thinking and coding agents out there. We’re also super-excited these faster and more efficient MAI models are coming to the N1X Satya mentioned, to deliver you the best performance on Windows.

This is what owning the full stack looks like.

It’s the foundation for Microsoft Frontier Tuning.

Letting you customize MAI models, using our full-stack hill-climbing machine.

It means disciplined, relentless engineering, on a platform you can trust, working on your behalf to create custom agents that you will control.

And of course, the really big thing to happen this last year is RLEs, reinforcement learning environments, unique training gyms for AIs.

They create company and task specific agents, adapted only to you, built on MAI models.

For example, within Microsoft we use our RLEs and MAI models to climb towards the best agentic use cases for Excel.

Our MAI tuned model is comparable to GPT 5.4 on public and private benchmarks, while being up to 10X more efficient.¹⁴

Other early adopters are seeing similar results.

When we tuned our models for McKinsey’s tasks, MAI delivered the highest win rate, outperforming GPT-5.5 on quality, whilst being 10x lower on cost.

This is the advantage of carefully calibrated Frontier Tuning.

And unlike with some of other companies, with MAI you don’t rent intelligence from a shared model that learns from everyone.

Only you keep the benefits of your hard-earned workflows, know-how, data and institutional knowledge.

Only you control the resulting model. So with us, the RLEs and the models you build inside of them become your moat.

This is distinct. It marks a new era in AI.

One final announcement that I’m very excited about.

We’re taking customization, and co-creation of models, to the highest levels possible, on one of the most important applications of AI: healthcare.

We’re proud to be partnering with Mayo Clinic to jointly develop a new frontier model for health, and then deploy it in their world-leading hospital system.

So today marks some very exciting steps on our journey to create humanist superintelligence at Microsoft.

We have an incredible roster of seven new world-class models to keep you all the frontier.

And we’re looking forward to working with you all to co-create unique AI agents, adapted to you.

This is a new era for all of us.

An era of AI that you control on your terms.

Let’s build it together.

Thank you everyone.

Footnotes

Source: Arena Image Edit (Single-Image Edit) leaderboard, as of June 2nd, 2026. Arena lists MAI-Image-2.5 at 1403±9 Arena Score, ahead of Gemini 3 Pro Image Preview 2K at 1388±3 and Gemini 3.1 Flash Image Preview / Nano Banana 2 at 1389±4.
MAI-Image-2.5 model page
Price-to-quality benchmark uses public Arena scores for text-to-image and image-to-image leaderboards and public API pricing from Microsoft, OpenAI, and Google as of May 2026. Cost is normalized to estimated price per 1,000 1024×1024 image generations
Accuracy (Word Error Rate) measured on FLEURS benchmark using the ‘test’ split. FLEURS is a standard public multilingual dataset that we used internally for competitive evaluations. The model achieves state-of-the-art average WER across 43 languages and leads in 18 of them – outperforming GPT-4o-Transcribe, Scribe v2, and Gemini 3.1 Flash Lite. See more at Transcribe 1.5 model page
See more in the model card, which includes pricing details. MAI-Transcribe-1.5 is a batch transcription model, and comparisons are made against similar systems. Benchmarked hyperscalers include Gemini and Azure Speech.
Speed measured by the Artificial Analysis ‘Speed Factor’ benchmark for Speech-to-Text models, whose methodology counts the number of audio-seconds transcribed per second. See the Transcribe 1.5 model page for details, including our comparison of MAI-Transcribe-1.5’s AA Speed Factor against top competing models.
MAI-Thinking-1 model page
Scored 52.8% on SWE Bench Pro
AIME 2025 Benchmark Dataset
Numbers from internal benchmark system with harness, using SWE Bench Pro. We outperform Haiku 4.5 across multiple benchmarks – see more in MAI-Code-1-Flash blog post
In the new token-based billing in GitHub Copilot, MAI-Code-1-Flash is priced cheaper than Claude Haiku 4.5. See pricing details.
MAI-Code-1-Flash is now rolling out to ~10% of individual users as a starting point and users who select Auto in VS Code model picker may get routed to the model.
Baseten x MAI-Thinking-1, OpenRouter x MAI-Image-2.5, OpenRouter x MAI-Voice-2, OpenRouter x MAI-Transcribe-1.5
We project a 10x improvement in output tokens per dollar from the fine-tuned MAI model compared to GPT 5.5, and up to ~10x improvement for GPT-5.4. This estimate is based on public GPT pricing and MAI pricing data scaled across model sizes using serving-cost differentials.

Introducing MAI-Transcribe-1.5

June 2, 2026

Models

Superintelligence team

Today we’re launching our MAI-Transcribe-1.5, the most accurate multilingual speech-to-text model with a best-in-class Word-Error-Rate (WER) across 43 languages.

This latest model has expanded the range of languages available without compromising accuracy and quality.

It’s now being integrated into Copilot, Teams, GitHub, and Dynamics 365 Contact Centre – and it’s also available in Foundry, where it’s the fastest, most efficient and most cost‑effective transcription model of any hyper-scaler.

Features and Capabilities

SOTA accuracy as shown on FLEURS multilingual transcription benchmark, and #3 on the Artificial Analysis leaderboard.
Leading accuracy x speed on Artificial Analysis leaderboard.
Expanded language coverage from 25 to 43.
Can transcribe an hour of audio in under 15 seconds. Up to five times faster on long audio than Gemini 3.1, Scribe v2, GPT-4o-Transcribe.
Includes Keyword Biasing, enabling the model to be aware of domain specific terminology which improves WER by up to 30% on FLEURS.
Optimized for real-world use cases such as being able to handle transcription with noisy backgrounds.

Accuracy

We expanded coverage by 18 new languages without compromising accuracy. On FLEURS – the standard multilingual benchmark – we have achieved best-in-class Word Error Rate across 43 languages, maintaining our position as the most accurate model on the benchmark.

A table showing word-error-rate percentages for 32 languages across four AI models: MALT Transcribe, Sonix v2, OpenAI 3.1 Flash Lite, and GPT-4o Transcribe. GPT-4o has the lowest average error rates for most languages.

On the Artificial Analysis leaderboard we achieved a Word Error Rate of 2.4%, achieving #3 position in a very competitive open benchmark.

Speed

MAI-Transcribe-1.5 is now a leader in terms of accuracy x speed on the Artificial Analysis leaderboard, running up to 5x faster than models of comparable accuracy.

This is particularly impactful when transcribing long audio files, as the model can transcribe an hour of audio in under 15 seconds.

Table comparing MAI-Transcribe-1 and MAI-Transcribe-15 for FLEURS (both ranked #1), overall WER (2.6% vs 2.4%), and transcription speed (both transcribe 1 hour of audio in 53 seconds).

Keyword biasing

A major challenge for many transcription models is when they fail on domain specific words, which often matter the most to users. These often include people and product names, medical terms, internal acronyms, and customer-specific vocabulary which are critical for enterprises.

MAI-Transcribe-1.5 can now bias its predictions toward a list of domain specific keywords provided by the user. The model does not blindly force matches, it uses the shared context to decide when keyword biasing should apply. This dramatically improves recognition of specialized vocabulary while maintaining accuracy on general speech.

When using the keyword biasing, we observe a 30% reduction in Word-Error-Rate (WER) on the FLEURS multilingual benchmark.

English

Without keyword biasing
So, um, for the next phase, Sean will, uh, take care of the documentation. Oif, right, uh, she’ll handle the user testing sessions. Societal is, um, leading the workflow design. Soren will, uh, set up the analytics, and Niamh is going to coordinate the deployment timeline.

With keyword biasing
List of keywords: “Aisling, Shaun, Xochitl, Ljubiša, Søren, Siobhán, Jorge, Nguyễn Phúc, Aoife, Tadhg, Ghislaine, Niamh, Szczepan, Eoin, Kseniya, Wojciech, Xavier, Maoz”
So, um, for the next phase, Shaun will, uh, take care of the documentation. Aoife, right, uh, she’ll handle the user testing sessions. Xochitl is, um, leading the workflow design. Søren will, uh, set up the analytics, and Niamh is going to coordinate the deployment timeline.

What’s next

Diarization – the ability to identify who said what in multi-speaker audio – essential for meetings, interviews, and call center analytics.
A native streaming API, enabling real-time transcription for live applications and voice agents, moving beyond the current batch-first approach.
Expanded language support – giving each new language the same depth of accuracy and robustness as the existing 43 languages.

Try it out

You can also explore the models directly in the MAI Playground.

Learn more about MAI-Transcribe-1.5

Model card [Link]
Foundry API documentation [Link]
Cookbook [Link]

Build the Future With Us

Explore all jobs

Building a hill-climbing machine: Launching seven new MAI models

June 2, 2026

Announcements

Mustafa Suleyman

Updated as of June 8, 2026.

Today we are announcing a family of seven new models developed in-house at Microsoft AI. Beyond these models, we’re building a superintelligence lab – a system and an approach we believe will define the next phase of AI.

This is an extraordinary time in technology. The compute used to train frontier models has increased by a factor of one trillion. Now we expect another thousand-fold increase over the next three years, which in turn means more advanced capabilities, and the continued rollout of ever more effective AI.

This epic compute ramp will change the nature of work, business and daily life. We all have to prepare for this reality. Our job at MAI is to help you do this – to push the frontier, and to build a hill-climbing machine to keep you at the frontier.

Here’s our first steps along the way.

Our Models

Our new models across image, voice, transcription, coding, and reasoning, together form the MAI model family: a multimodal ecosystem designed to work across the kinds of tasks that matter in the real world.

MAI-Thinking-1, Microsoft AI’s flagship reasoning model. It is a medium-sized model that stands among the strongest models in its weight class: it matches leading models on key software engineering benchmarks, and demonstrates advanced mathematical reasoning capabilities, and is preferred to Sonnet 4.6 in our blind human side-by-side evaluations. We trained it from the ground up on clean data, without distillation from third-party models.
MAI-Code-1-Flash is an inference-efficient agentic coding model. This model is tailor-made for and deeply integrated into GitHub Copilot, VS Code and the Microsoft stack, and, with 5 billion active parameters, is comparable to Haiku but cheaper.
MAI-Image-2.5 including its ultra-efficient Flash variant, supports both world-class text-to-image and image editing, surpassing the Arena score of Nano Banana Pro.
MAI Transcribe-1.5 is the best transcription model in the world, with SOTA accuracy. It’s five times faster than competing models, with built-in support for domain-specific terminology across 43 languages.
MAI-Voice-2 brings high-quality, natural-sounding speech generation across 15 languages, with the ability to adapt to a voice from a short sample, alongside strong safeguards against misuse. MAI-Voice-2-Flash, coming soon, does it in a lower cost, ultra-efficient package.

Alongside distribution on Foundry and optimization for our 1P products, our models are also going to be widely available for developers on OpenRouter, as well as Fireworks and Baseten. For the first time developers will be able to tune the weights of the model themselves.

All these models share the same infrastructure and the same commitment to clean, enterprise grade data lineage. We don’t distill from other labs and we don’t rely on opaque data. Our datasets are clean, traceable, and enterprise-grade. They are designed to work together, and to integrate directly into the products people use every day. But the models themselves are only part of the story.

The most important shift lies in what you can do with them.

Adapted for you

AI is moving into a new phase. With reinforcement learning in real-world environments, AI can fully adapt to the specifics of a given workflow for the first time. We call this Microsoft Frontier Tuning. We think it’s the future for how AI shows up. Learn more in the Microsoft 365 blog post.

We think it’s the future for how AI shows up.

In this set-up the most valuable data is yours: the trace of real work an agent completes, the sequence of steps, the decisions, the actions taken that define how tasks actually get done inside an organization.

Our reinforcement learning environments (RLEs) allow your MAI models to learn directly from your workflows. Think of them as training gyms for AI, accessible only to you.

With Frontier Tuning, you’re building your own model, trained on your data, within your environment, controlled by you. Your institutional knowledge becomes part of the model, and it stays yours. What’s even better is that this adaptation then drives efficiency and performance.

Across Microsoft and with customers, Frontier Tuning is showing that custom models are both better and more efficient: our MAI tuned model for Excel matches GPT 5.4 while being up to 10× more efficient. Early adopters are seeing similar gains at the frontier. When tuned for a market-leading organization’s exacting enterprise standards, MAI achieved the highest win rate of any model tested at roughly 10× lower cost.

Developers and businesses have been crying out for AI that delivers on their terms and under their say. We see this as a major step towards delivering that.

Frontier health intelligence with the Mayo Clinic

There are a number of high importance, high sensitivity domains – like health – that require even more intense collaboration. That’s why today we’re also announcing that Microsoft and Mayo Clinic are collaborating to co-create a frontier AI model for healthcare that brings together Mayo Clinic’s world-leading clinical expertise, de-identified clinical data and longitudinal insights with Microsoft’s foundational AI capabilities.

This model will be designed to excel at the broadest scope of clinical reasoning and healthcare use cases, reaching a level that today’s general-purpose systems simply cannot match.

The model will first be deployed within Mayo Clinic’s own environment, the world’s top hospital system, where we expect it to enable a broad range of capabilities, including earlier and more accurate diagnoses and treatment planning. Once validated, the model will be made available to other organizations via Microsoft Foundry, making Mayo Clinic’s expertise accessible to many more who need it.

The frontier AI model will be owned by Mayo Clinic, reinforcing our mutual longstanding commitment to patient trust, clinical rigor, safety and responsible stewardship of clinical health data and AI.

Our Lab

At Microsoft AI, we recognize that there are no shortcuts to the frontier. We train our reasoning models from scratch. We don’t distill from other labs and we don’t rely on unlicensed or opaque data. Our datasets are clean and appropriately licensed. Every component of the system, from architecture to training pipeline to post-training, we built ourselves. We co-design with our own Maia 200 silicon, and are already seeing a 1.4x efficiency boost from these efforts. This is all about long term self-sufficiency for Microsoft and our partners. It’s about models you can trust.

The goal here is to build what we think of as a hill-climbing machine: an organization that can continuously improve, cycle after cycle, as we apply more compute, better data, and sharper evaluation.

We think scientific rigor is critical to this. That’s why for everything we do, we ablate, we measure, we document. We invest heavily in data pipelines. We work in small teams with falsifiable goals over short time periods, matching velocity with quality, focus with ambition. And we are committed to transparency. We want to bring you with us on this journey. That’s why today we are publishing in depth safety and technical reports.

Humanist Superintelligence

This is what we are building in MAI: a family of models, with new versions available now. A lab built on first principles, focused on long-term capability. And a new approach to tuning and ownership that we believe will define the next phase of AI.

Our ultimate goal is what we call Humanist Superintelligence. That means advanced AI systems designed to serve people and organizations, not replace them. These systems must remain tools, shaped by human intent, accountable to human oversight, and ultimately subordinate to human goals. People – you – must always remain in control.

Over the next year, stand by for a rapid scale-up in our compute and capabilities as we push to make this ambition a reality. It’s a new phase for AI, and for us.

– The MAI Team

Build the Future With Us

Explore all jobs

Introducing MAI-Thinking-1

June 2, 2026

Models

Superintelligence team

Updated as of June 8, 2026.

Today we are introducing MAI-Thinking-1, Microsoft AI’s reasoning model. It is a medium-sized model that stands among the strongest models in its weight class. It matches leading models on key software engineering benchmarks, demonstrates advanced mathematical reasoning capabilities, and is preferred to Sonnet 4.6 in our blind human side-by-side evaluations. We don’t distill from other labs and we don’t rely on opaque data. Our datasets are clean, traceable, and enterprise-grade.

MAI-Thinking-1 is a step in our broader work to build towards Humanist Superintelligence: advanced AI capabilities designed to serve people and organizations, not to replace them. The model matters on both axes: what it can do, and how it was built.

The Hill-Climbing Machine

More than a single model, we are excited to introduce our Hill-Climbing Machine: a co-designed pipeline built to make every component of model development climbable, so capabilities improve continually and reliably over time. The aim is a repeatable system that can absorb better data, stronger rewards, more capable environments, and more compute.

Three main pillars guide our philosophy.

First, capabilities should be learned, not inherited. Although faster to acquire, inherited intelligence lacks the steerability essential for real world usage: an imitator is fundamentally tied to the design choices of its teacher and struggles to adapt to new situations. MAI-Thinking-1 was trained without distillation from third party models, forcing our model to truly learn the tasks at hand.

Second, clean data. We trained it from the ground up on clean, traceable and enterprise-grade data, without distillation from third-party models. This matters for quality, provenance, and control. If we cannot account for what shaped a model, we cannot fully understand its behavior or credibly improve it.

Third, self-sufficiency across the entire stack. All the way from co-design of our models with MSFT’s own accelerators through to our reinforcement learning framework, we have focused efforts on in-house training infrastructure. This is a crucial part of building our hill-climbing machine, to ensure we can fully optimize and shape our systems end-to-end to best serve our needs.

Medium-sized model, with strong software engineering performance

MAI-Thinking-1 is a 35B-active, ~1T-total parameters, sparse Mixture of Experts model, a smaller inference footprint than much larger models. Despite this, our model is toe-to-toe with Claude Opus 4.6 on SWE-Bench Pro. That matters for developers and enterprises because model size determines where advanced coding assistance can be deployed, how often it can be used, and whether it can move from exceptional tasks into daily workflows.

We have invested heavily in the training environments needed for agentic coding. Each verified environment is deterministic, executable, and graded by real test suites. This gives the model practice on the kind of multi-step work developers actually do: reading code, editing files, running tests, observing failures, and recovering from intermediate mistakes.

Advanced mathematical reasoning capabilities

MAI-Thinking-1 reaches 97.0% on AIME 2025, and 94.5% on AIME 2026, showing strong mathematical and scientific reasoning for its weight class. Strong performance here gives us confidence that our training loop can create real reasoning gains – climbing all the way from the ground up – from our own data, rewards, and evaluation process, enabling this intelligence to generalize to other domains over time.

Line graph titled "AIME 2025" shows a general upward trend in the y-axis values (ranging from 0.2 to 1.0) over increasing x-axis steps, with fluctuations and small vertical error bars.

Preferred in human side-by-sides vs. Sonnet 4.6

People care about whether a model understands the task, follows instructions, uses the right level of detail, writes clearly, and respects their time.

We built a blind side by side human evaluation with one of our partners, Surge, using their pool of professional raters to measure various models on these traits. The evaluation spanned 1,276 tasks across a wide variety of use cases in both single-turn and multi-turn conversations, with a focus on measuring how helpful each response is and whether it actually advances the user’s goals. In these evaluations, users preferred MAI-Thinking-1 over Claude Sonnet 4.6.

This has been a core focus of post-training. We want the model to be capable without being brittle, concise without being incomplete, and helpful without overreaching. Human preference data gives us a direct signal on whether benchmark improvements translate into better experiences for users.

Enterprise ready

MAI-Thinking-1 is built with enterprise readiness in mind. It supports long context with a 256k token window (enough to fit a 600 page document), function calling, and the flexibility to add developer instructions. We trained the model to follow multiple layers of instructions and aligned its default style to enterprise needs. It’s compatible with the widely used Chat Completions API. All MAI models come with enterprise-grade security and compliance through Microsoft Foundry.

Results

We report results in two views: post-trained MAI-Thinking-1 evaluations, and pre-training metrics for our base model.

Table 1. MAI-Thinking-1 metrics

A comparison table showing language models' performance on STEM and Agentic coding benchmarks. MAI-THINKING1 leads with the highest scores across most benchmarks, outperforming other models like Sonnet 4.6, Opus 4.6, and GPT 5.4.

Post-trained model evaluation results on public STEM and agentic coding benchmarks. Other model numbers are taken from respective official model cards. Scores are percentages unless otherwise noted; dashes indicate unavailable model values.

Table 2. Pre-training metrics

Four bar charts compare bits-per-byte scores (lower is better) of base pre-trained models across Held-Out Code, QA, STEM, and Math domains, showing performance differences by model size and architecture.

Putting humans first

We are building towards Humanist Superintelligence: advanced AI capabilities designed to serve people and organizations, not replace them. Our models must remain subordinate technologies under human control with the goal of upholding human autonomy and being helpful. That means our models must not refuse legitimate requests under the guise of safety and compliance as then they are not truly serving humans.

Striking the delicate balance between being helpful and safe is not easy. For MAI-Thinking-1, we aimed to achieve this balance by treating unsafe compliance and unnecessary refusal as defects in the same reward construction where aggregation is based on severity of potential of harm. Safety is trained with the same reinforcement learning infrastructure used for capability, so safety rewards are part of the same hill-climbing loop ensuring safety is always aligned to the capabilities and not incidental.

As a result, we see that our model can balance ensuring a safety bar on sensitive unsafe requests while also being helpful on non-sensitive content.

Scatter plot titled "Safety vs Helpfulness by Harm Category." Dots indicate MAI-Thinking-1 and Sonnet 4.6 scores by category; y-axis is safety, x-axis is helpfulness, with lines connecting paired results for each harm category.

Availability and access

MAI-Thinking-1 is available in private preview on Microsoft Foundry today. It will be available in public preview on MAI Playground soon.

Build the Future With Us

We’re a lean, fast-moving lab made up of some of the world’s most talented minds. We have an exciting roadmap of compute at MAI, which is ramping quickly and extensively. And we have an ambitious mission we truly believe in. We’re also fortunate to partner with incredible product teams giving our models the chance to reach billions of users and create immense positive impact. If you’re a brilliant, highly-ambitious and low ego individual, you’ll fit right in—come and join us as we work on our next generation of models!

Explore all jobs

Introducing MAI-Code-1-Flash

June 2, 2026

Models

Superintelligence team

Updated as of June 8, 2026.

Today we’re introducing MAI-Code-1-Flash, a new Microsoft coding model built for fast, efficient assistance in everyday developer workflows. We trained it from the ground up on clean, traceable and enterprise-grade data, without distillation from third-party models. The model is rolling out to GitHub Copilot individual users in Visual Studio Code in the model picker and under the default auto picker.

Features and capabilities

Agentic coding in real developer environments, trained and designed for GitHub Copilot harness, to work better together.
Adaptive thinking, stays concise for simple requests and spends more reasoning budget on complex tasks.
Strong instruction-following across single-turn and multi-turn scenarios.

MAI-Code-1-Flash is designed around the simple goal of delivering high-quality coding help with better efficiency. It outperforms Claude Haiku 4.5 with better price to performance across coding benchmarks.

A scatter plot compares coding models on pass rate vs. average token usage. MAI-Code-1-Flash (green) outperforms Claude Haiku 4.5 (orange) across benchmarks, with higher pass rates and lower token use in the highlighted “Ideal Zone.”.

Build for developers, not benchmarks

Coding models are most useful when they perform well in the same environment developers use every day. That is why we built MAI-Code-1-Flash with production workflows at the center, rather than optimizing only for benchmarks. The model was trained directly with GitHub Copilot harnesses used in production. This allows it to learn how to interact with surrounding tools and systems in agentic coding tasks, making it uniquely well suited to real-world Copilot workflows compared to other available models.

During training, we evaluated checkpoints across core software engineering tasks, repository question answering, refactoring, and telemetry-grounded tasks adapted from real GitHub Copilot usage. This alignment between training, evaluation, and production helps offline improvements translate into real-world developer quality.

Designed to maximize value per token

MAI-Code-1-Flash was trained with adaptive solution length control, which helps the model adjust the depth of its response to the task. It can stay concise for simpler requests and spend more reasoning budget when a problem requires deeper analysis or broader code changes. In practice, this means developers start seeing useful output sooner. We see MAI-Code-1-Flash solving harder problems with up to 60% fewer tokens. This helps reduce latency, lower cost, improve return on token, and make interactive workflows feel smoother.

Benchmark results in the production harness

To understand both quality and efficiency, we evaluated MAI-Code-1-Flash against Claude Haiku 4.5 on SWE-Bench Verified, SWE-Bench Pro, SWE-Bench Multilingual, and Terminal Bench 2 using the same production harness that developers use for their everyday coding tasks. We measured task success and the average number of solution tokens required to complete each task.

MAI-Code-1-Flash outperforms Claude Haiku 4.5 across all core coding benchmarks tested, with higher pass rates on all 4 evaluations, including a +16-point lead on the diverse, real-world tasks of SWE-Bench Pro (51.2% vs. 35.2%). It’s not just smarter; it’s leaner, solving harder problems with up to 60% fewer tokens on SWE-Bench Verified, proving that higher accuracy and greater efficiency are no longer a trade-off.

A comparison table of coding benchmarks for MAI-Code-7-Flash and Claude Haiku 4.5, showing pass rates and average token usage for four benchmarks, with MAI-Code-7-Flash outperforming in all categories.

Math, Science, Instruction Following, and Agentic coding tasks

Bar chart comparing four benchmark scores (IF Bench, Advanced IF, Robust IF, τ¹-Bench) for MAI-Code-1-Flash and Claude Haiku 4.5, with MAI-Code-1-Flash consistently scoring higher in all categories.

MAI-Code-1-Flash comes out ahead on every benchmark in the table, with the widest margin on IF Bench precise instruction following (+28.9) and the narrowest on rubric-based Advanced IF (+14.5). The strong instruction-following carries over to agentic tool use.

Furthermore, MAI-Code-1-Flash also outperforms Claude Haiku-4.5 on core reasoning capabilities in math, science, and visual generation coding.

A comparison table shows benchmarks for MAI-Code-T-Flash and Claude Haiku 4.5, listing accuracy and average token usage (K) for tasks like math, science, text reasoning, and coding. MAI-Code-T-Flash leads in all benchmarks.

Standard benchmarks reward memorization as much as reasoning, for example a model that has seen the Monty Hall problem will answer it correctly, but invert the prizes and it fails. We built a 186-question, 34-category benchmark around adversarial traps like inverted classics, impossible tasks, and underdetermined scenarios to see whether models were actually reasoning or just pattern-matching. MAI-Code-1-Flash surpasses Claude Haiku 4.5 overall and reached 85.8% adjusted accuracy, with especially strong performance in reasoning, instruction-following, and recognizing impossible problems. We also see room for the model to grow, since core adversarial categories like Einstellung traps remained below 50% accuracy.

Try it out

MAI-Code-1-Flash is now rolling out to VS Code GitHub Copilot individual users. No additional setup is required. As the rollout progresses, you may see GitHub Copilot route tasks to MAI-Code-1-Flash through the Auto picker, or see the model available directly in the model picker.

Here are a few fun sample apps we built with MAI-Code-1-Flash in VS Code:

We would love to hear from you! Please join the GitHub Community to share your feedback.

Build the Future With Us

Explore all jobs

MAI-Image-2-Efficient: Flagship Quality, 41% Lower Cost

April 14, 2026

Models

MAI Superintelligence Team

A collage featuring a beige knitted sweater against a blue sky, close-up texture shots, and a person wearing the sweater, all set on a brown background with curved white lines.

Available now in Microsoft Foundry and MAI Playground

We built MAI-Image-2 to be our best text-to-image model — photorealistic, expressive, with reliable in-image text.

Today we’re making all that faster and cheaper.

Meet MAI-Image-2-Efficient.

Production-ready quality. Built for speed and scale. 22% faster and 4x more efficient¹. And priced nearly 41% lower — $5 per 1M text input tokens, $19.50 per 1M image output tokens.

That’s not just faster than our own flagship. It’s 40% faster on average than other leading text-to-image models².

MAI-Image-2-Efficient leads on render speed

Full render time (LTR) measured at P50 median, in seconds, across standardized prompts. Lower is better.

Two models, two jobs

MAI-Image-2-Efficient is your production workhorse. Use it when you need volume, speed, and tight cost control — product shots, marketing creatives, UI mockups, branded assets, batch pipelines. It handles short-form text like headlines and labels cleanly, and it’s built to run in real-time, interactive workflows without breaking a sweat.

MAI-Image-2 is your precision tool. Reach for it when the brief demands the highest fidelity — portraits, photorealistic scenes, stylized looks like anime or illustration, and longer or more complex in-image text. This is the model for final deliverables where every detail matters.

Start building now

MAI-Image-2-Efficient is available today in Microsoft Foundry and MAI Playground³. No waitlist, no preview — just plug it in and go. It’s also rolling out across Copilot and Bing, with more surfaces like PowerPoint coming soon.

Partners like Shutterstock are already testing with promising results:

“MAI-Image-2-Efficient shows strong progress in prompt fidelity and creative usability across a range of workflows. In our evaluation work, we look closely at how well models translate intent into consistent, production-ready outputs, and this model is trending in the right direction. That level of reliability is what ultimately matters when teams move from experimentation into real-world use.” – Vanessa Salvo, Principal Product Manager, Shutterstock

This is just the beginning. More models ahead — stay tuned.

A collage of six sections: clothing labels, orange slices with bottles, close-up tomatoes, skin care products with sky background, bottles with figs, and abstract orange and white graphic with the words "THE FUTURE CAN WAIT.

Download Model Card

As tested on April 13, 2026. Compared to MAI-Image-2 when normalized by latency and GPU usage. Throughput per GPU vs MAI-Image-2 on NVIDIA H100 at 1024×1024; measured with optimized batch sizes and matched latency targets. Results vary with batch size, concurrency, and latency constraints.
As tested on April 13, 2026. Compared to Gemini 3.1 Flash (high reasoning), Gemini 3.1 Flash Image and Gemini 3 Pro Image: Measured at p50 latency via AI Studio API (1:1, 1K images; minimal reasoning unless noted; web search disabled). MAI-Image-2, MAI-Image-2e, GPT-Image-1.5-High: Measured at p50 latency via Foundry API.
MAI Playground is available in select markets including the US. Coming soon to EU countries.

Announcing 3 new world class MAI models, available in Foundry

April 2, 2026

Models

Mustafa Suleyman

Updated as of June 9, 2026.

Introducing MAI-Transcribe-1, alongside MAI-Voice-1 and MAI-Image-2. World-class quality at lightning speeds, now available at the most competitive prices.

Available now in Microsoft Foundry and MAI Playground.

MAI-Transcribe-1 delivers state-of-the-art speech-to-text transcription across the top 25 most-used languages ¹ according to the industry-standard FLEURS benchmark. ² Built to deliver world class quality in messy, real-world environments, its batch transcription speed is 2.5x that of existing Microsoft Azure Fast offering. It’s also incredibly efficient, making MAI-Transcribe-1 not just the most accurate, but also lightning fast. It’s now available in Foundry at the best price-performance of any large cloud provider.

Overall Average WER by Model

Lower is better.

MAI-Voice-1 is our top-tier voice generation model. Built to generate natural, realistic speech, rich with nuance, emotional range and expression that preserves speaker identity even across long-form content.

Today we’re adding the ability to safely and securely create your own custom voice in Microsoft Foundry with just a few seconds of audio. MAI-Voice-1 can transform how easily developers can build voice experiences and voice agents – at high quality and high speed.

The model can generate 60 seconds of audio in just a single second, and highly efficient GPU usage delivers that quality and speed affordably. Hearing is believing, so experience it for yourself with Copilot Audio Expressions or Copilot Podcasts.

The text "MAI-Transcribe-1" appears in bold, brown letters centered on a plain, light beige background.

MAI-Image-2 has turbocharged image generation performance and speed on Copilot after debuting as a top 3 model family on the Arena.ai leaderboard. Users experience at least 2x faster generation times on Foundry and Copilot with similar quality, based on real-world production traffic data. Phased rollouts are also underway in Bing and PowerPoint.

MAI-Image-2 was created with photographers, designers, and visual storytellers that demand natural lighting, accurate skin tones and texture, and clear in-image text for diagrams, layouts, and graphics. Once again, speed and quality don’t come at higher costs – MAI-Image-2 is offered at competitive price-to-performance.

Customers are already embracing MAI-Image-2 for creative work. WPP, one of the world’s largest marketing and communications groups, is among the first enterprise partners building with MAI-Image-2 at scale.

“MAI-Image-2 is a genuine game-changer. It’s a platform that not only responds to the intricate nuance of creative direction, but deeply respects the sheer craft involved in generating real-world, campaign-ready images,” said Rob Reilly, Global Chief Creative Officer, WPP. “WPP has some of the best creative talent in the world and MAI-Image-2 is making them even better.”

A woman stands holding an orange umbrella, photographed from a low angle. She wears a white blouse and blue jeans, with the background featuring a plain white wall.

A bottle labeled "Sofily" sits on a wooden table decorated with peach flowers and a vase of orange blooms, bathed in warm sunlight with a shadowy background.

Images created by WPP using MAI-Image-2

MAI Models: Better, faster, and cheaper than our competitors.

We are rapidly deploying these top-tier models to power our own consumer and commercial products. We’re excited to share the quality, speed, and efficiency gains with our Microsoft Foundry customers with very competitive pricing.

· MAI-Transcribe-1 starts at $0.36 per hour.

· MAI-Voice-1 starts at $22 per 1M characters.

· MAI-Image-2 starts at $5 per 1M tokens for text input and $33 per 1M tokens for image output.

Available now on Microsoft Foundry and MAI Playground.

Starting today, every developer can build with MAI models, including MAI-Transcribe-1, through Microsoft Foundry. You can also try them in the MAI Playground.

Interested in MAI models but don’t have Foundry access?
Fill out this form and we’ll be in touch.

Models that are built to be better from the inside out.

At Microsoft AI, we’re building Humanist AI. We have a distinct view when creating our AI models — putting humans at the center, optimizing for how people actually communicate, training for practical use. You’ll see more models from us soon in Foundry and directly in Microsoft products and experiences.

Consistent with our commitment to safe and responsible AI, these MAI models were developed, tested, and rigorously red-teamed. Through Microsoft Foundry, developers get built-in guardrails, governance, and enterprise-grade controls designed to support safe, compliant deployment at scale.

Model Cards

Download Model Card for MAI-Transcribe-1

Download Model Card for MAI-Voice-1

Download Model Card for MAI-Image-2

^1. Top 25 languages by Microsoft product usage.

^2. Out of the top 25 global languages, MAI-Transcribe-1 ranks 1st by FLEURS in 11 core languages. It wins against Whisper-large-v3 on the remaining 14 and Gemini 3.1 Flash on 11 of those 14.

MAI-Thinking-1

MAI-Code-1-Flash

MAI-Image-2.5

MAI-Transcribe-1.5

MAI-Thinking-1

MAI-Code-1-Flash

MAI-Image-2.5

MAI-Transcribe-1.5

Building a hill-climbing machine: Launching seven new MAI models

MAI-Image-2.5 launches at No. 2 for image editing on Arena

Introducing MAI-Code-1-Flash

Introducing MAI-Voice-2

Features and capabilities

Performance

Guess the human recording vs. MAI‑Voice‑2

Supported Languages

Voice Synthesis

Consent and Safety

Use Cases

Try it out

Build the Future With Us

Related Stories

Building a hill-climbing machine: Launching seven new MAI models

Introducing MAI-Transcribe-1.5

Announcing 3 new world class MAI models, available in Foundry

MAI-Image-2.5 launches at No. 2 for image editing on Arena

Features and capabilities

Benchmarks

Powering Microsoft products

Best price-to-performance models

Safety and limitations

Try it out

Build the Future With Us

Related Stories

Building a hill-climbing machine: Launching seven new MAI models

MAI-Image-2-Efficient: Flagship Quality, 41% Lower Cost

Announcing 3 new world class MAI models, available in Foundry

Our values in operation: Health

Related Stories

Building a hill-climbing machine: Launching seven new MAI models

Health Check: How People Use Copilot for Health

The Path to Medical Superintelligence

Microsoft Build 2026: MAI keynote transcript

Related Stories

Building a hill-climbing machine: Launching seven new MAI models

Introducing MAI-Transcribe-1.5

Announcing 3 new world class MAI models, available in Foundry

Introducing MAI-Transcribe-1.5

Features and Capabilities

Accuracy

Speed

Keyword biasing

What’s next

Try it out

Build the Future With Us

Related Stories

Building a hill-climbing machine: Launching seven new MAI models