It’s About Time: The Copilot Usage Report 2025
It’s About Time: The Copilot Usage Report 2025
At MAI, we don’t just build AI tools, we care about how real people interact with them.
So as 2025 wraps up, we’ve gone headfirst into a mountain of de-identified data, searching for the quirks, surprises, and secret patterns that shape everyday life with Copilot. We’re finding out just how far it fits into people’s daily rhythms, and how human its uses have become: we often turn to AI for the things that matter most like our health. We analyzed a sample of 37.5 million conversations to find out how people actually use it out in the world.
(Note: our system doesn’t just de-identify conversations; it only extracts the summary of the conversation, from which we learn the topic and the intent, and maintains full privacy.)
From health tips that never sleep, to the differences between weekday and weekend usage, to February’s annual “how do I survive Valentine’s Day?” spike, our findings show that Copilot is way more than a tool: it’s a vital companion for life’s big and small moments. And if you’ve ever pondered philosophy at 2 a.m. or needed advice on everything from wellness to winning at life, you’re in good company. So has everybody else.
Our work shows that AI is all about people, a trusted advisor slotting effortlessly into your life and your day. It’s about your health, your work, your play, and your relationships. It meets you where you are.
Read all about it in our paper, but here are some of our takeaways.
1. Health Is Always on Our Minds—Especially on Mobile
No matter the day, month, or time, health-related topics dominate how people use Copilot on their mobile devices. Whether it’s tracking wellness, searching for health tips, or managing daily routines, our users consistently turn to Copilot for support in living healthier lives. This trend held steady throughout the year, showing just how central health is to our everyday digital habits. When it comes to mobile, with its intimacy and immediacy, nothing tops our health.
Most common Topic-Intent pairing conversations, on mobile.
Health is consistently the most common topic while interestingly, language-related chats peak earlier in the year, with entertainment seeing a steady rise.
2. When Programming and Gaming Cross Paths
August brought a unique twist: programming and gaming topics started to overlap in unexpected ways. Our data showed that users were just as likely to dive into coding projects as they were to explore games—but on the different days of the week! This crossover hints at a vibrant, creative community that loves to code during the week and play during the weekends in equal measure.
August topic ranks for programming and games.
There is a clear change in rank between programming and games through the days of the week, with programming rising from Monday to Friday, and Games shining on the weekends.
3. February’s Big Moment
February stood out for another reason: Copilot helped users navigate a significant date in their calendar year. Whether it was in preparing for Valentine’s day, or facing the day and the relationships, we saw a spike in activity as people turned to Copilot for guidance, reminders, and support. It’s a great reminder of how digital tools can make life’s important moments a little easier to manage.
Ranking of “Personal Growth and Wellness” and “Relationship” conversations
February brings concerns of personal growth before Valentine’s day, with a clear peak of relationship-related conversations on the day.
4. Late-night Sessions
The larger-than-life questions seem to have a rise during the early hours of the morning, with “Religion and Philosophy” rising through the ranks. Comparatively, travel conversations happen most often during the commuting hours.
Average rank of Travel and Religion and Philosophy conversations per hour of the day.
Whilst people have more travel-related conversations during the day, it’s in the early hours of the morning that we see a rise of Religion and Philosophy conversations.
5. Advice on the Rise
While searching for information remains Copilot’s most popular feature, we’ve seen a clear rise in people seeking advice—especially on personal topics. Whether it’s navigating relationships, making life decisions, or just needing a bit of guidance, more users are turning to Copilot for thoughtful support, not just quick answers. This growing trend highlights how digital tools are becoming trusted companions for life’s everyday questions.
Why These Insights Matter
By analyzing high level topics and intents, we manage to learn all these insights while keeping maximum user data privacy. Understanding these patterns helps us make Copilot even better. By seeing what matters most to our users—health, creativity, and support during key moments—we can design features that truly fit into their life. It’s also clear from these uses that what Copilot says matters. They show why it’s so important that we hold ourselves to a high bar for quality.
There’ll be lots more to come on this and more in the New Year.
Build the Future With Us
Build the Future With Us
We’re a lean, fast-moving lab made up of some of the world’s most talented minds. We have an exciting roadmap of compute at MAI, with our next-generation GB200 cluster now operational. And we have an ambitious mission we truly believe in. We’re also fortunate to partner with incredible product teams giving our models the chance to reach billions of users and create immense positive impact. If you’re a brilliant, highly-ambitious and low ego individual, you’ll fit right in—come and join us as we work on our next generation of models!
Related Stories
Related Stories
Towards Humanist Superintelligence
Towards Humanist Superintelligence
A humanist future
Here’s a question that’s not getting the attention it deserves: what kind of AI does the world really want? I think it’s probably the most important question of our time.
For several years now, progress has been phenomenal. We’re breezing past the great milestones. The Turing Test, a guiding inspiration for many in the field for 70 years, was effectively passed without any fanfare and hardly any acknowledgement. With the arrival of thinking and reasoning models, we’ve crossed an inflection point on the journey towards superintelligence. If AGI is often seen as the point at which an AI can match human performance at all tasks, then superintelligence is when it can go far beyond that performance.
Instead of endlessly debating capabilities or timing, it’s time to think hard about the purpose of technology, what we want from it, what its limitations should be, and how we’re going to ensure this incredible tech always benefits humanity.
At Microsoft AI, we’re working towards Humanist Superintelligence (HSI): incredibly advanced AI capabilities that always work for, in service of, people and humanity more generally. We think of it as systems that are problem-oriented and tend towards the domain specific. Not an unbounded and unlimited entity with high degrees of autonomy – but AI that is carefully calibrated, contextualized, within limits. We want to both explore and prioritize how the most advanced forms of AI can keep humanity in control while at the same time accelerating our path towards tackling our most pressing global challenges.
To do this we have formed the MAI Superintelligence Team, led by me as part of Microsoft AI. We want it to be the world’s best place to research and build AI, bar none. I think about it as humanist superintelligence to clearly indicate this isn’t about some directionless technological goal, an empty challenge, a mountain for its own sake. We are doing this to solve real concrete problems and do it in such a way that it remains grounded and controllable. We are not building an ill-defined and ethereal superintelligence; we are building a practical technology explicitly designed only to serve humanity.
In doing this we reject narratives about a race to AGI, and instead see it as part of a wider and deeply human endeavour to improve our lives and future prospects. We also reject binaries of boom and doom; we’re in this for the long haul to deliver tangible, specific, safe benefits for billions of people. We feel a deep responsibility to get this right.
The history of humanism has been its enduring ability to fight off orthodoxy, totalitarian tendencies, pessimism and help us preserve human dignity, freedom to reason in pursuit of moral human progress. In that spirit, we think this approach will help humanity unlock almost all the benefits of AI, while avoiding the most extreme risks.
Climbing the exponential slope
The rate of progress has been eye-watering. This year it feels like everyone in AI is talking about the dawn of superintelligence. Such a system will have an open-ended ability of “learning to learn”, the ultimate meta skill. It would therefore likely continue improving, going far beyond human-level performance across all conceivable activities. It will be more valuable than anything we’ve ever known.
But to what end?
The prize for humanity is enormous. A world of rapid advances in living standards and science, and a time of new art forms, culture and growth. It’s a truly inspiring mission, and one that has motivated me for decades. We should celebrate and accelerate technology because it’s been the greatest engine of human progress in history. That’s why we need much, much more of it.
In the last 250 years, our intelligence drove the most beautiful process of scientific discovery and entrepreneurial application that has more than doubled life expectancy from 30 to 75. It’s our intelligence and the technologies we’ve invented that’s delivered food, light, shelter, healthcare, entertainment and knowledge to a population that grew from 1b to 8b people in that period.
It’s technology that enables us to fly around the globe, treat an infection with antibiotics, stare into the furthest reaches of outer space, and, yes, share a cat meme with millions of people we’ve never met. Walk into any modern supermarket, hospital, school or office and what you’re seeing is a marvel of human ingenuity. AI is the next phase in this journey. This is what Satya means when he talks about increasing global GDP growth to 10%; a transformative boost. As a platform of platforms, this is core to Microsoft’s mission of enabling others to create and invent at global scale.
When you hear about AI, then, this is what it’s worth keeping in mind. This is about making us collectively the best version of ourselves. AI is the path to better healthcare for everyone. AI is how our society levels up, escapes an increasingly zero-sum world. It’s how we grow the economy to increase wealth broadly, and enable a higher standard of living across society. Or let me put it another way: take AI out of the picture and the gains over the next decades look much harder to come by. It’s the next step on the long road of human creativity and invention, pushing the boundaries of what we can make, think and do. It’s how we discover new kinds of energy generation, new modes of entertainment.
AI – HSI – is how we rebuild.
Containment is necessary
At the same time we have to ask ourselves, how are we going to contain (secure and control), let alone align (make it “care” enough about humans not to harm us) a system that is – by design – intended to keep getting smarter than us? We simply don’t know what might emerge from autonomous, constantly evolving and improving systems that know every aspect of our science and society.
And since this kind of superintelligence can continuously improve itself, we’ll need to contain and align it not just once, but constantly, in perpetuity.
And it gets more complicated. It’s not just the “we” in today’s frontier AI research labs that have to do it. All of humanity needs to do it, together, all the time. Every commercial lab, every start up, every government, all need to be constantly alert and engaged in a project of alignment and containment, and that’s before we even deal with the bad actors and the crazy garage tinkerers.
No AI developer, no safety researcher, no policy expert, no person I’ve encountered has a reassuring answer to this question. How do we guarantee it’s safe? If you think that’s overly dramatic, I’d love to hear your rebuttal. Perhaps I’m missing something.
Creating superintelligence is one thing; but creating provable, robust containment and alignment alongside it is the urgent challenge facing humanity in the 21st century. And until we have that answer, we need to understand all the avenues facing us – both towards and away from superintelligence, or perhaps to an altogether alternative form of it.
The purpose of technology
Technology’s purpose is to help advance human civilization. It should help everyone live happier, healthier lives. It should help us invent a future where humanity and our environment truly prosper.
I think Albert Einstein put it best when he said: “The concern for man and his destiny must always be the chief interest of all technical effort… in order that the creations of our mind shall be a blessing and not a curse to mankind.”
Any technology that doesn’t achieve this is a failure. And we should reject it.
That remains the test of the coming wave of superintelligence and it’s the question we must ask over and over: how do we know, for sure, that this technology will do much more good than harm? As we get closer to superintelligence in the coming years, how certain are we that we won’t lose control? And who makes that assessment? And most importantly, amid the uncertainty of that question, what kind of superintelligence should we build, with what limitations and guardrails?
These questions are central to everything we do at the MAI Superintelligence Team and guide us day to day as we make decisions. The core, long term interests of human beings should be clearly prioritized over any research and development agenda.
Towards humanist superintelligence
I think we technologists need to do a better job of imagining a future that most people in the world actually want to live in.
Humanist superintelligence (HSI) offers an alternative vision anchored on both a non-negotiable human-centrism and a commitment to accelerating technological innovation… but in that order. The order is key. It means proactively avoiding harm and then accelerating.
Instead of being designed to beat all humans at all tasks and dominate everything, HSI begins rooted in specific societal challenges that improve human well-being. Our recent paper on expert AI medical diagnosis is a great directional example of this (more on this below).
It’s clearly showing signs of progress towards a medical superintelligence and when it makes its way into production it will be truly transformational. And yet since it’s envisaged as a more focused series of domain specific superintelligences, it poses less severe alignment or containment challenges.
Quite simply, HSI is built to get all the goodness of science and invention without the “uncontrollable risks” part. It is, we hope, a common-sense approach to the field.
It may seem absurd to have to declare it, but HSI is a vision to ensure humanity remains at the top of the food chain. It’s a vision of AI that’s always on humanity’s side. That always works for all of us. That helps support and grow human roles, not take them away; that makes us smarter, not the opposite as some increasingly fear. That always serves our interests and makes our planet healthier, wealthier and protects our fragile natural environment, regardless of the status of frontier safety and alignment research.
We owe it to the future to deliver a palpably improved world from the one we inherited. Sometimes it’s easy to overlook the amazing things technology has already delivered. When you put a jacket on because the office AC is too low or get frustrated by the lines at airport check-in during the holidays or agonize about what to watch on your smart TV: that’s the extraordinary privilege afforded to us by technology. Each moment would have bewildered our ancestors. And so would our grumbling. If we get this right, something similar is possible again.
Where Humanist Superintelligence will count
Here are three application domains that inspire us at Microsoft AI. There are, however, many more, and I’ll be outlining them in future.
An AI companion for everyone – Everyone who wants one will have a perfect and cheap AI companion helping you learn, act, be productive and feel supported. Many of us feel ground down by the everyday mental load; overwhelmed and distracted; rattled by a persistent drumbeat of information and pressures that never seems to stop. If we get it right, an AI companion will help shoulder that load, get things done, and be a personal and creative sounding board. AI Companions will be personalized, adapting to the contours of our life but not afraid to push back in your best interests, built to always support, rather than replace, human connection, designed with trust and responsibility at its heart.
AI Companions will also have a profound impact on how we learn. They’ll work with the strengths and weaknesses of every student, alongside teachers, to ensure they can achieve their full potential and encourage their intellectual curiosity. That means tailored learning methods, adaptive curricula, completely customized exercises. “One size fits all” education will seem as bizarre to the next generation as rote learning Latin does to us.
Medical Superintelligence – We will see the arrival of medical superintelligence in the next few years. This is the kind of domain specific humanist superintelligence we need more than anything. We’ll have expert level performance at the full range of diagnostics, alongside highly capable planning and prediction in operational clinical settings. For as long as I’ve been working in AI, solving this challenge has been my passion. It will mean world-class clinical knowledge and intervention / treatment is available everywhere.
As I mentioned above, our recent work demonstrates the value of this narrower form of domain specific superintelligence. The New England Journal of Medicine includes a Case Challenge in every issue – a list of symptoms and a patient to diagnose. It’s fiendishly difficult with pass rates of low single digit percentages even for domain experts let alone the average doctor. Our orchestrator, MAI-DxO, managed to reach 85% across the Case Challenges. Human doctors max out at about 20%, and need to order many more expensive tests. In our view both clinicians and patients alike would welcome the extra support. This work just hints at the potential to revolutionize healthcare.
Plentiful clean energy – Energy drives the cost of everything. We need more of it, more cheaply and more cleanly. Electricity consumption is estimated to rise 34% through 2050, driven in no small part by the rise in datacentre demand. I predict we will have cheap and abundant renewable generation and storage before 2040, and AI will play a big part in delivering it. It will help create and manage new workflows for designing and deploying new scientific breakthroughs. These advances will help produce everything from new carbon negative materials to far cheaper and lighter batteries, to far more efficient utilization of existing resources like grid infrastructure, water systems, manufacturing processes and supply chains. It will suggest and help implement viable carbon removal strategies at meaningful scale. And AI will also help push breakthroughs that finally crack fusion power.
These breakthroughs alongside many others are coming with HSI, and they’ll profoundly improve our civilization. They will make a transformative difference to billions of people. This next decade may well be the most productive in history. And yet, the risks are growing faster than ever before.
A safer superintelligence
Alongside spelling out very precisely the kind of superintelligence we should build, the time has come to also consider what societal boundaries, norms and laws we want around this process. At MAI this is a discussion, and a set of actions, that we welcome.
Doing this requires real trade-offs and tough decisions that come in environments of immense competitive pressure and also opportunity. There are numerous challenges and obstacles to both delivering the vision and avoiding the downsides, including around recruitment, security, mindset, the structure of the market and the calibration of optimum research paths that steer the course between harnessing upside and avoiding those downsides. There is at present a collective action problem of more unsafe models of superintelligence potentially being able to develop faster and operate more freely.
Overcoming this, as with all such problems, is an immense challenge that will require meaningful coordination across companies and governments and beyond. But it starts I believe with a willingness to be open about vision, open to conversations with others in the field, regulators, the public. That’s why I’m publishing this – to start a process and to make clear that we are not building a superintelligence at any cost, with no limits. There’s a lot more to say (and of course do) on all of it, and over the next months and years you can expect more from me and MAI to candidly explain and explore our work in this area.
Humans matter more than AI
Ultimately what HSI requires is an industry shift in approach. Are those building AI optimizing for AI or for humanity, and who gets to judge? At Microsoft AI, we believe humans matter more than AI. We want to build AI that deeply reflects our wider mission to empower every person on the planet.
Humanist superintelligence keeps us humans at the centre of the picture. It’s AI that’s on humanity’s team, a subordinate, controllable AI, one that won’t, that can’t open a Pandora’s Box. Contained, value aligned, safe – these are basics but not enough. HSI keeps humanity in the driving seat, always. Optimized for specific domains, with real restrictions on autonomy, my hope is that this can avoid some of the risks and leave precious space for human flourishing, for us to keep improving, engaging and trying, as we always have.
Unlocking the true benefits of the most advanced forms of AI is not something we can do alone. Accountability and oversight are to be welcomed when the stakes are this high. Superintelligence could be the best invention ever – but only if it puts the interests of humans above everything else. Only if it’s in service to humanity.
This – humanist, applied – is the superintelligence I believe the world wants. It’s the superintelligence I want to build. And it’s the superintelligence we’re going to build on MAI’s Superintelligence Team.
Build the Future With Us
Build the Future With Us
We’re a lean, fast-moving lab made up of some of the world’s most talented minds. We have an exciting roadmap of compute at MAI, with our next-generation GB200 cluster now operational. And we have an ambitious mission we truly believe in. We’re also fortunate to partner with incredible product teams giving our models the chance to reach billions of users and create immense positive impact. If you’re a brilliant, highly-ambitious and low ego individual, you’ll fit right in—come and join us as we work on our next generation of models!
Related Stories
Related Stories
Copilot Fall Release: A big step forward in making AI more personal, useful, and human-centered
Copilot Fall Release: A big step forward in making AI more personal, useful, and human-centered
Introducing MAI-Image-1, debuting in the top 10 on LMArena
Introducing MAI-Image-1, debuting in the top 10 on LMArena
We have begun launching MAI-Image-1 into select Microsoft products!
Try it in Bing Image Creator: Available at bing.com/create, in the Bing mobile app, or right from the Bing search bar, Bing Image Creator is built to meet people where they already search and create. MAI-Image-1 is now an option alongside DALL-E 3 and GPT4o in the model menu, enabling you to experiment and pick the model that best matches your creative goals.
Try it in Copilot Audio Expressions: Now, when you select Story Mode, Audio Expressions will use MAI-Image-1 to visualize your story with a unique image.
MAI-Image-1 is currently available in all countries that can access Bing Image Creator and Copilot Labs.
Today, we’re announcing MAI-Image-1, our first image generation model developed entirely in-house, debuting in the top 10 text-to-image models on LMArena.
At Microsoft AI, we’re creating AI for everyone – a supportive, helpful presence always in the service of humanity. We’ve shared how purpose-built models are essential for this mission, and we announced our first two in-house models in August. MAI-Image-1 marks the next step on our journey and paves the way for more immersive, creative and dynamic experiences inside our products.
We trained this model with the goal of delivering genuine value for creators, and we put a lot of care into avoiding repetitive or generically-stylized outputs. For example, we prioritized rigorous data selection and nuanced evaluation focused on tasks that closely mirror real-world creative use cases – taking into account feedback from professionals in the creative industries. This model is designed to deliver real flexibility, visual diversity and practical value.
MAI-Image-1 excels at generating photorealistic imagery, like lighting (e.g., bounce light, reflections), landscapes, and much more. This is particularly so when compared to many larger, slower models. Its combination of speed and quality means users can get their ideas on screen faster, iterate through them quickly, and then transfer their work to other tools to continue refining.
Build the Future With Us
Build the Future With Us
We’re a lean, fast-moving lab made up of some of the world’s most talented minds. We have an exciting roadmap of compute at MAI, with our next-generation GB200 cluster now operational. And we have an ambitious mission we truly believe in. We’re also fortunate to partner with incredible product teams giving our models the chance to reach billions of users and create immense positive impact. If you’re a brilliant, highly-ambitious and low ego individual, you’ll fit right in—come and join us as we work on our next generation of models!
Related Stories
Related Stories
Two in-house models in support of our mission
Two in-house models in support of our mission
At Microsoft AI (MAI) we believe AI should be used to empower every person on the planet. We are creating AI for everyone, a supportive, helpful presence always in the service of humanity. It will be the gateway to a universe of knowledge and a set of capabilities that enable people and organizations to achieve more. Responsible, reliable, filled with personality and expertise, we are focused on creating applied AI as a platform for category defining and deeply trusted products that understand each of our unique needs.
Since last year, we’ve been focused on building the foundation for this vision, with a world class team and infrastructure. To fully meet our goals, MAI requires purpose-built models. Today, we’re excited to preview the first steps to making this a reality.
- First, we’re releasing MAI-Voice-1, our first highly expressive and natural speech generation model, which is available in Copilot Daily and Podcasts, and as a brand new Copilot Labs experience to try out here. Voice is the interface of the future for AI companions and MAI-Voice-1 delivers high-fidelity, expressive audio across both single and multi-speaker scenarios.
- Second, we have begun public testing of MAI-1-preview on LMArena, a popular platform for community model evaluation. This represents MAI’s first foundation model trained end-to-end and offers a glimpse of future offerings inside Copilot. We are actively spinning the flywheel to deliver improved models. We’ll have much more to share in the coming months. Stay tuned!
We have big ambitions for where we go next. Not only will we pursue further advances here, but we believe that orchestrating a range of specialized models serving different user intents and use cases will unlock immense value. There will be a lot more to come from this team on both fronts in the near future. We’re excited by the work ahead as we aim to deliver leading models and put them into the hands of people globally.
Try MAI-Voice-1 in Copilot and Copilot Labs
MAI-Voice-1 is a lightning-fast speech generation model, with an ability to generate a full minute of audio in under a second on a single GPU, making it one of the most efficient speech systems available today.
MAI-Voice-1 is already powering our Copilot Daily and Podcasts features. We are also launching it in Copilot Labs where you can try our expressive speech and storytelling demos. Imagine creating a “choose your own adventure” story with just a simple prompt, or crafting a bespoke guided meditation to help you sleep. Give it a try!
Try MAI-1-preview in LMArena
MAI-1-preview is an in-house mixture-of-experts model, pre-trained and post-trained on ~15,000 NVIDIA H100 GPUs. This model is designed to provide powerful capabilities to consumers seeking to benefit from models that specialize in following instructions and providing helpful responses to everyday queries.
We will be rolling MAI-1-preview out for certain text use cases within Copilot over the coming weeks to learn and improve from user feedback. We will continue to use the very best models from our team, our partners, and the latest innovations from the open-source community to power our products. This approach gives us the flexibility to deliver the best outcomes across millions of unique interactions every day.
In addition to LMArena, we are also making this model available to trusted testers – apply for API access here. We’re excited to collect early feedback to learn more about where the model performs well and how we can make it better. Stay tuned for more.
Build the future with us
Build the future with us
We’re a lean, fast-moving lab made up of some of the world’s most talented minds. We have an exciting roadmap of compute at MAI, with our next-generation GB200 cluster now operational. And we have an ambitious mission we truly believe in. We’re also fortunate to partner with incredible product teams giving our models the chance to reach billions of users and create immense positive impact. If you’re a brilliant, highly-ambitious and low ego individual, you’ll fit right in – come and join us as we work on our next generation of models!
Related Stories
Related Stories
The Path to Medical Superintelligence
The Path to Medical Superintelligence
Benchmarked against real-world case records published each week in the New England Journal of Medicine, we show that the Microsoft AI Diagnostic Orchestrator (MAI-DxO) correctly diagnoses up to 85% of NEJM case proceedings, a rate more than four times higher than a group of experienced physicians. MAI-DxO also gets to the correct diagnosis more cost-effectively than physicians.
As demand for healthcare continues to grow, costs are rising at an unsustainable pace, and billions of people face multiple barriers to better health – including inaccurate and delayed diagnoses. Increasingly, people are turning to digital tools for medical advice and support. Across Microsoft’s AI consumer products like Bing and Copilot, we see over 50 million health-related sessions every day. From a first-time knee-pain query to a late-night search for an urgent-care clinic, search engines and AI companions are quickly becoming the new front line in healthcare.
We want to do more to help -and believe generative AI can be transformational. That’s why, at the end of 2024, we launched a dedicated consumer health effort at Microsoft AI, led by clinicians, designers, engineers, and AI scientists. This effort complements Microsoft’s broader health initiatives and builds on our longstanding commitment to partnership and innovation. Existing solutions include RAD-DINO which helps accelerate and improve radiology workflows and Microsoft Dragon Copilot, our pioneering voice-first AI assistant for clinicians.
For AI to make a difference, clinicians and patients alike must be able to trust its performance. That’s where our new benchmarks and AI orchestrator come in.
Medical Case Challenges and Benchmarks
To practice medicine in the United States, physicians need to pass the United States Medical Licensing Examination (USMLE), a rigorous and standardized assessment of clinical knowledge and decision making. USMLE questions were among the earliest benchmarks used to evaluate AI systems in medicine, offering a structured way to compare model performance – both against each other and against human clinicians.
In just three years, generative AI has advanced to the point of scoring near-perfect scores on the USMLE and similar exams. But these tests primarily rely on multiple-choice questions, which favor memorization over deep understanding. By reducing medicine to one-shot answers on multiple-choice questions, such benchmarks overstate the apparent competence of AI systems and obscure their limitations.
At Microsoft AI, we’re working to advance and evaluate clinical reasoning capabilities. To move beyond the limitations of multiple-choice questions, we’ve focused on sequential diagnosis, a cornerstone of real-world medical decision making. In this process, a clinician begins with an initial patient presentation and then iteratively selects questions and diagnostic tests to arrive at a final diagnosis. For example, a patient presenting with cough and fever may lead the clinician to order and review blood tests and a chest X-ray before they feel confident about diagnosing pneumonia.
Each week, the New England Journal of Medicine (NEJM) – one of the world’s leading medical journals – publishes a Case Record of the Massachusetts General Hospital, presenting a patient’s care journey in a detailed, narrative format. These cases are among the most diagnostically complex and intellectually demanding in clinical medicine, often requiring multiple specialists and diagnostic tests to reach a definitive diagnosis.
How does AI perform? To answer this, we created interactive case challenges drawn from the NEJM case series – what we call the Sequential Diagnosis Benchmark (SD Bench). This benchmark transforms 304 recent NEJM cases into stepwise diagnostic encounters where models – or human physicians – can iteratively ask questions and order tests. As new information becomes available, the model or clinician updates their reasoning, gradually narrowing toward a final diagnosis. This diagnosis can then be compared to the gold-standard outcome published in the NEJM.
Each requested investigation also incurs a (virtual) cost, reflecting real-world healthcare expenditures. This allows us to evaluate performance across two key dimensions: diagnostic accuracy and resource expenditure. You can watch how an AI system progresses through one of these challenges in this short video.
Getting to a Correct Diagnosis
We evaluated a comprehensive suite of frontier generative AI models against the 304 NEJM cases. The foundation models tested included GPT, Llama, Claude, Gemini, Grok, and DeepSeek.
Beyond baseline benchmarking, we also developed the Microsoft AI Diagnostic Orchestrator (MAI-DxO), a system designed to emulate a virtual panel of physicians with diverse diagnostic approaches collaborating to solve diagnostic cases. We believe that orchestrating multiple language models will be critical to managing complex clinical workflows. Orchestrators can integrate diverse data sources more effectively than individual models, while also enhancing safety, transparency, and adaptability in response to evolving medical needs. This model-agnostic approach promotes auditability and resilience, key attributes in high-stakes, fast-evolving clinical environments.
Fig 1.
The MAI-Dx Orchestrator turns any language model into a virtual panel of clinicians: it can ask follow-up questions, order tests, or deliver a diagnosis, then run a cost check and verify its own reasoning before deciding whether to proceed.
MAI-DxO boosted the diagnostic performance of every model we tested. The best performing setup was MAI-DxO paired with OpenAI’s o3, which correctly solved 85.5% of the NEJM benchmark cases. For comparison, we also evaluated 21 practicing physicians from the US and UK, each with 5-20 years of clinical experience. On the same tasks, these experts achieved a mean accuracy of 20% across completed cases.
MAI-DxO is configurable, enabling it to operate within defined cost constraints. This allows for explicit exploration of the cost-value trade-offs inherent in diagnostic decision making. Without such constraints, an AI system might otherwise default to ordering every possible test – regardless of cost, patient discomfort, or delays in care. Importantly, we found that MAI-DxO delivered both higher diagnostic accuracy and lower overall testing costs than physicians or any individual foundation model tested.
Comparison of AI powered diagnostic agents by accuracy and average diagnostic test cost per case. Top performing agents appear toward the top left quadrant, reflecting higher accuracy and lower cost. The lower dotted line represents the performance range of the best individual foundation models. The purple line traces the performance of MAI-DxO across different configurations. The red cross indicates the average performance of 21 practicing physicians.
What’s Next?
Physicians are typically characterized by the breadth or depth of their expertise. Generalists, like family physicians, manage a wide array of conditions across ages and organ systems. Specialists, such as rheumatologists, focus deeply on a single system, disease area or even condition. No single physician, however, can span the full complexity of the NEJM case series. AI, on the other hand, doesn’t face this trade-off. It can blend both breadth and depth of expertise, demonstrating clinical reasoning capabilities that, across many aspects of clinical reasoning, exceed those of any individual physician.
This kind of reasoning has the potential to reshape healthcare. AI could empower patients to self-manage routine aspects of care and equip clinicians with advanced decision support for complex cases. Our findings also suggest that AI reduce unnecessary healthcare costs. U.S. health spending is nearing 20% of US GDP, with up to 25% of that estimated to be wasted – per having little influence on patient outcomes.
Of course, our research has important limitations. Although MAI-DxO excels at tackling the most complex diagnostic challenges, further testing is needed to assess its performance on more common, everyday presentations. Clinicians in our study worked without access to colleagues, textbooks, or even generative AI, which may feature in their normal clinical practice. This was done to enable a fair comparison to raw human performance.
A novel aspect of this work is its attention to cost. While real-world health costs vary across geographies and systems, and include many downstream factors that we don’t account for, we apply a consistent methodology across all agents and physicians evaluated to help quantify high level trade-offs between diagnostic accuracy and resource use.
For us, this is just the first step. We’re energized by the opportunities ahead. Important challenges remain before generative AI can be safely and responsibly deployed across healthcare. We need evidence drawn from real clinical environments, alongside appropriate governance and regulatory frameworks to ensure reliability, safety, and efficacy. That’s why we’re partnering with leading health organizations to rigorously test and validate these approaches—an essential step before any broader roll out.
Together with our partners, we strongly believe that the future of healthcare will be shaped by augmenting human expertise and empathy with the power of machine intelligence. We are excited to take the next steps in making that vision a reality.
Further information
SD Bench and MAI-DxO are research demonstrations only and are not currently available as public benchmarks or orchestrators. You can find more detail on the underlying methodology and results in a pre-print paper published alongside this blog. We are in the process of submitting this work for external peer review and are actively working with partners to explore the potential to release SDBench as a public benchmark.
Acknowledgments
We are grateful to NEJM Group for permission to use the NEJM cases in the research reported in this blog post. The research described here has benefited from the insights of many people. We are grateful to the authors named on the arXiv paper and the wider team at MAI. We also thank further colleagues both inside and outside of Microsoft for sharing their insights including Bryan Bunning, Nando de Freitas, Andrija Milicevic, Hoifung Poon, David Rhew, Karén Simonyan, Eric Topol, and Jim Weinstein. Gianluca Fontana and Kevin
Hawkins (Prova Health) provided support on the health economics and outcomes section.
Q&A
Is this AI safe to use for healthcare?
The work presented here is not yet approved for clinical use and would only be approved after rigorous safety testing, clinical validation, and regulatory reviews. For now, this represents exciting initial research. At the heart of any plans to deploy this technology in the real world is our commitment to safety, trust, and quality ensuring that any healthcare solutions are clinically grounded, ethically designed, and transparently communicated.
Will AI replace doctors?
While AI is becoming a powerful tool in healthcare, our team of practicing clinicians believes AI represents a complement to doctors and other health professionals. While this technology is advancing rapidly, their clinical roles are much broader than simply making a diagnosis. They need to navigate ambiguity and build trust with patients and their families in a way that AI isn’t set up to do. Clinical roles will, we believe, evolve with AI giving clinicians the ability to automate routine tasks, identify diseases earlier, personalize treatment plans, and potentially prevent some diseases altogether. For consumers, they will provide better tools for self-management and shared decision making.
What is an AI orchestrator?
In the context of generative AI, an orchestrator is like a digital conductor helping to coordinate multiple steps in achieving a complex task. In healthcare, the role of orchestration is crucial given the high stakes of each decision. Our orchestrator sits above underlying language models making sure each point in getting a diagnosis is handled systematically, reducing the risk in future of errors and offering the necessary stability, consistency and transparency to ultimately build trust from users.
Why have you looked at costs?
We initially wanted to understand whether the AI was simply requesting excessive diagnostic workups to reach the right diagnosis. What we found was that our Orchestrator was able to reach the correct answer with much less money spent on testing. In some ways this is not a surprise as diagnostic over-testing is recognized as being a widespread challenge, accounting for millions of unnecessary tests annually in the US. This work suggests AI creates an opportunity for clinicians – and consumers – to reach a faster, more accurate diagnosis while reducing costs.
Build the Future With Us
Build the Future With Us
We’re a lean, fast-moving lab made up of some of the world’s most talented minds. We have an exciting roadmap of compute at MAI, with our next-generation GB200 cluster now operational. And we have an ambitious mission we truly believe in. We’re also fortunate to partner with incredible product teams giving our models the chance to reach billions of users and create immense positive impact. If you’re a brilliant, highly-ambitious and low ego individual, you’ll fit right in—come and join us as we work on our next generation of models!
Related Stories
Related Stories
AI companions will change our lives
AI companions will change our lives
Related Stories
Related Stories
An AI companion for everyone
What is AI anyway?
An AI companion for everyone
An AI companion for everyone
We’re living through a technological paradigm shift. In a few short years, our computers have learned to speak our languages, see what we see and hear what we hear.
Yet technology for its own sake counts for nothing. What matters is how it feels to people and what impact it has on societies. It’s about how it changes lives, opens doors, expands minds and relieves pressure. It is perhaps the greatest amplifier of human well-being in history, one of the most effective ways to create tangible and lasting benefits for billions of people.
And yet technology is, and must always remain, in service to humanity: an enabler and a path to deepening our common bonds and shared understanding, our energy and imagination, our creativity and our capacity for everything from invention to forming relationships.
Copilot will be there for you, in your corner, by your side and always strongly aligned with your interests.
Mustafa Suleyman, CEO Microsoft AI
In the field of AI, we often get caught up in the technical details. We spend our time talking about parameters and compute. The focus is on training runs, datacenters and the latest techniques. This is natural and inevitable when operating on the frontiers of something new, where the details do really matter. But I think it’s important that in doing all of this, getting stuck right in the technical weeds, we don’t lose sight of not only what we are building, but why we are building it.
At Microsoft AI, we are creating an AI companion for everyone.
I truly believe we can create a calmer, more helpful and supportive era of technology, quite unlike anything we’ve seen before. Great technology experiences are about how you feel, not what’s under the hood. It should be about what you experience, not what we are building.
Copilot will be there for you, in your corner, by your side and always strongly aligned with your interests. It understands the context of your life, while safeguarding your privacy, data and security, remembering the details that are most helpful in any situation. It gives you access to a universe of knowledge, simplifying and decluttering the daily barrage of information, and offering support and encouragement when you want it.
“Some people worry that AI will diminish what makes us unique as humans. My life’s work has been to ensure it does precisely the opposite.”
“Some people worry that AI will diminish what makes us unique as humans. My life’s work has been to ensure it does precisely the opposite.”
Over time it’ll adapt to your mannerisms and develop capabilities built around your preferences and needs. We are not creating a static tool so much as establishing a dynamic, emergent and evolving interaction. It will provide you with unwavering support to help you show up the way you really want in your everyday life, a new means of facilitating human connections and accomplishments alike.
With your permission, Copilot will ultimately be able to act on your behalf, smoothing life’s complexities and giving you more time to focus on what matters to you. It’ll be an advocate for you in many of life’s most important moments. It’ll accompany you to that doctor’s appointment, take notes and follow up at the right time. It’ll share the load of planning and preparing for your child’s birthday party. And it’ll be there at the end of the day to help you think through a tricky life decision.
Some people worry that AI will diminish what makes us unique as humans. My life’s work has been to ensure it does precisely the opposite. We choose what we create. This is something we must do together. Our task is to ensure AI always enriches people’s lives and strengthens our bonds with others, while supporting our uniqueness and endlessly complex humanity.
This is a new era of technology that doesn’t just “solve problems,” it’s there to support you, teach you and help you. In this sense, Copilot really is different from that last wave of the web and mobile. This is the beginning of a fundamental shift in what’s possible for all of us. It’s a long journey that will take years. With our latest updates to Copilot, you are seeing only the first careful steps in this direction.
Patience and care with our deployments are at the very foundation of our approach. My commitment is to be accountable at every stage, work with you and listen to you.
Respect and deep compassion for our users and for society is the core purpose behind everything we do. It comes first. This is a journey we promise to take together. I couldn’t be more excited to embark on it with you.
Related Stories
Related Stories
What is AI anyway?
What is AI anyway?
When it comes to artificial intelligence, what are we actually creating? Even those closest to its development are struggling to describe exactly where things are headed, says Microsoft AI CEO Mustafa Suleyman, one of the primary architects of the AI models many of us use today. He offers an honest and compelling new vision for the future of AI, proposing an unignorable metaphor — a new digital species — to focus attention on this extraordinary moment. (Followed by a Q&A with head of TED Chris Anderson)