Microsoft Build 2026: MAI Keynote Transcript

June 2, 2026
Build
Superintelligence team

Since I started working in AI, the compute that we use to train frontier models has increased by a trillion-fold. That’s 12 orders of magnitude in just 15 years.

It’s now clear that a consistent, exponential increase in computation leads to predictable advances in AI capabilities.

In the next few years, we’ll see three more orders of magnitude of compute applied to frontier models.

Intelligence is now a function of compute. Log linear hill-climbing has become the norm. The scaling laws are holding. These are truly extraordinary times.

In this context we at MAI are building towards what we call Humanist Superintelligence.

State of the art AI capabilities that are explicitly designed to serve people and organizations, and not to replace them.

Because the type of AI we build really matters. We need an AI that places humanity first.

That always prioritizes human well-being and human progress.

This is the core philosophy and motivation behind our superintelligence efforts at Microsoft. It shapes everything we do.

And as a platform company, our job – and our commitment – is to keep you developers building at the absolute frontier.

So today we’re very excited to announce a family of seven new models across image, voice, transcription, thinking, and coding.

These are all built with real attention to detail and a commitment to making practical, efficient tools that are tuned for how you work in the real world.

First up: MAI-Image-2.5 and its Flash variant – two super strong models that deliver a step change in quality, now at number 2 on the leaderboards, surpassing the score of Nano Banana 2 on image editing1

They deliver precision editing with incredible control and consistency.

Flash is for super-efficient production workloads at scale, while 2.5 gives maximum fidelity and professional grade performance.

They’re live in PowerPoint, rolling out to OneDrive2, and today they’re also landing on Foundry with market-leading quality per dollar.3

Then we have MAI-Transcribe-1.5. This is the best transcription model in the world, with SOTA accuracy across 43 languages, beating out Gemini and OpenAI models.4

We’ve optimized it for real world use so you can produce highly accurate transcripts for your bespoke use cases up to 5x faster than rival models.6

It’s now being integrated into Copilot, Teams, GitHub, and Dynamics 365 Contact Centre – and it’s also available in Foundry, where it’s the fastest, most efficient and most cost-effective transcription model of any hyper-scaler.5

Paired with that we have MAI-Voice-2, our latest speech generation model.

It has beautiful prosody, native-sounding delivery and fine-grained emotional control, and its available in 15 languages with lots more coming soon.

We’re also announcing Voice-2-Flash – which provides the best value and speed for ultra latency-sensitive Voice Agents, the big thing in 2026.

Next up is our text foundation model, MAI-Thinking-1, our first reasoning model.7 It’s exceptionally strong in our target use cases of reasoning and SWE tasks.

It’s a 35B active parameter MoE with a 256K context window. That means it competes in the medium-sized weight class, where it’s certainly punching above its weight.

Independent human raters on Surge prefer it for overall quality in blind side-by-sides versus Sonnet 4.6.

And it’s achieved 97% on AIME 25, the key measure of its general-purpose reasoning abilities.9

Most importantly, it’s at 53% on SWE Bench Pro, placing it right alongside Opus 4.6 on one of the toughest coding benchmarks.8

We’ve got plenty more work to do as we get it into production so we can hill-climb on many more real-world coding tasks.

What’s most remarkable is that this model has climbed entirely from the bottom, without specifically targeting any of these benchmarks, and with zero distillation.

This is critical. Because it means the model is created with an enterprise-grade, clean and commercially licenced data lineage that you can trust, and put into production with complete confidence.

Finally, I’m incredibly excited to announce MAI-Code-1-Flash, our new inference efficient coding model, especially tuned for VS Code and GitHub Copilot CLI.

Code-1-Flash achieves 51% on SWE Bench Pro10, despite having just 5B parameters, putting it closer to Haiku in size but cheaper in cost11, but delivering very strong coding performance, and great inference efficiency. It’s rolling out today as one of the default models in VS Code.12

Alongside distribution on Foundry and optimization for our 1P products, our models are also going to be widely available for developers on Open Router, as well as Fireworks and Baseten.13 This means for the first time you will be able to tune the weights directly yourself.

Across this entire family, safety and security are built in from the start.

Voice models include protections against unauthorized cloning, and all outputs are watermarked.

We’ve reduced over-refusals and improved representation, including for people with disabilities.

We’re also publishing a detailed technical report alongside this release to give you a full and transparent understanding of how we put this together.

And we’re also co-designing our models with our own silicon, optimizing MAI-Thinking-1 on our Maia 200 chip and benchmarking it head-to-head against the GB200.

On top of the 30% improvement that Satya just mentioned, we’re seeing a further 1.4x performance-per-watt gain when running our MAI models on the Maia 200 end to end. This is huge.

Every watt counts at this scale, and silicon-model co-design is a key advantage, helping us deliver you the most efficient thinking and coding agents out there. We’re also super-excited these faster and more efficient MAI models are coming to the N1X Satya mentioned, to deliver you the best performance on Windows.

This is what owning the full stack looks like.

It’s the foundation for Microsoft Frontier Tuning.

Letting you customize MAI models, using our full-stack hill-climbing machine.

It means disciplined, relentless engineering, on a platform you can trust, working on your behalf to create custom agents that you will control.

And of course, the really big thing to happen this last year is RLEs, reinforcement learning environments, unique training gyms for AIs.

They create company and task specific agents, adapted only to you, built on MAI models.

For example, within Microsoft we use our RLEs and MAI models to climb towards the best agentic use cases for Excel.

Our MAI tuned model is comparable to GPT 5.4 on public and private benchmarks, while being up to 10X more efficient.14

Other early adopters are seeing similar results.

When we tuned our models for McKinsey’s tasks, MAI delivered the highest win rate, outperforming GPT-5.5 on quality, whilst being 10x lower on cost.

This is the advantage of carefully calibrated Frontier Tuning.

And unlike with some of other companies, with MAI you don’t rent intelligence from a shared model that learns from everyone.

Only you keep the benefits of your hard-earned workflows, know-how, data and institutional knowledge.

Only you control the resulting model. So with us, the RLEs and the models you build inside of them become your moat.

This is distinct. It marks a new era in AI.

One final announcement that I’m very excited about.

We’re taking customization, and co-creation of models, to the highest levels possible, on one of the most important applications of AI: healthcare.

We’re proud to be partnering with Mayo Clinic to jointly develop a new frontier model for health, and then deploy it in their world-leading hospital system.

So today marks some very exciting steps on our journey to create humanist superintelligence at Microsoft.

We have an incredible roster of seven new world-class models to keep you all the frontier.

And we’re looking forward to working with you all to co-create unique AI agents, adapted to you.

This is a new era for all of us.

An era of AI that you control on your terms.

Let’s build it together.

Thank you everyone.

Footnotes

  1. Source: Arena Image Edit (Single-Image Edit) leaderboard, as of June 2nd, 2026. Arena lists MAI-Image-2.5 at 1403±9 Arena Score, ahead of Gemini 3 Pro Image Preview 2K at 1388±3 and Gemini 3.1 Flash Image Preview / Nano Banana 2 at 1389±4.
  2. MAI-Image-2.5 model page
  3. Price-to-quality benchmark uses public Arena scores for text-to-image and image-to-image leaderboards and public API pricing from Microsoft, OpenAI, and Google as of May 2026. Cost is normalized to estimated price per 1,000 1024×1024 image generations
  4. Accuracy (Word Error Rate) measured on FLEURS benchmark using the ‘test’ split. FLEURS is a standard public multilingual dataset that we used internally for competitive evaluations. The model achieves state-of-the-art average WER across 43 languages and leads in 18 of them – outperforming GPT-4o-Transcribe, Scribe v2, and Gemini 3.1 Flash Lite. See more at Transcribe 1.5 model page
  5. See more in the model card, which includes pricing details. MAI-Transcribe-1.5 is a batch transcription model, and comparisons are made against similar systems. Benchmarked hyperscalers include Gemini and Azure Speech.
  6. Speed measured by the Artificial Analysis ‘Speed Factor’ benchmark for Speech-to-Text models, whose methodology counts the number of audio-seconds transcribed per second. See the Transcribe 1.5 model page for details, including our comparison of MAI-Transcribe-1.5’s AA Speed Factor against top competing models.
  7. MAI-Thinking-1 model page
  8. Scored 52.8% on SWE Bench Pro
  9. AIME 2025 Benchmark Dataset
  10. Numbers from internal benchmark system with harness, using SWE Bench Pro. We outperform Haiku 4.5 across multiple benchmarks – see more in MAI-Code-1-Flash blog post
  11. In the new token-based billing in GitHub Copilot, MAI-Code-1-Flash is priced cheaper than Claude Haiku 4.5. See pricing details.
  12. MAI-Code-1-Flash is now rolling out to ~10% of individual users as a starting point and users who select Auto in VS Code model picker may get routed to the model.
  13. Baseten x MAI-Thinking-1, OpenRouter x MAI-Image-2.5, OpenRouter x MAI-Voice-2, OpenRouter x MAI-Transcribe-1.5
  14. We project a 10x improvement in output tokens per dollar from the fine-tuned MAI model compared to GPT 5.5, and up to ~10x improvement for GPT-5.4. This estimate is based on public GPT pricing and MAI pricing data scaled across model sizes using serving-cost differentials.

Related Stories

English (United States)
Your Privacy Choices Opt-Out Icon Your Privacy Choices
Consumer Health Privacy Sitemap Contact Microsoft Privacy Manage cookies Terms of use Trademarks Safety & eco Recycling About our ads