MAI-Voice-2

Turn text into expressive, natural-sounding speech in seconds.

Features

MAI-Voice-2 produces natural, expressive speech from text or a short reference clip, with built-in guardrails ensuring only authorized, consented voices can be used.

Realistic expression

Organic pacing, tone, and emotional range that sound like a person, not a text-to-speech engine.

Voice

Acacia

Joy

Acacia

Anger

Acacia

Disgust

Acacia

Fear

Acacia

Sadness

Emotion

Elm

Joy

Elm

Anger

Elm

Disgust

Elm

Fear

Elm

Sadness

Emotion

Birch

Joy

Birch

Anger

Birch

Disgust

Birch

Fear

Birch

Sadness

Emotion

Grove

Joy

Grove

Anger

Grove

Disgust

Grove

Fear

Grove

Sadness

Emotion

Instant voice matching

Capture any voice from a short reference clip, no fine-tuning needed.
Stable, high-fidelity output that preserves speaker consistency across audiobooks, podcasts, and lectures.
Lectures
Audiobooks
Podcasts
Courses
Documentaties

Natural and expressive across 15 languages

Fluid, emotionally rich speech in 15 languages, without sacrificing quality.

German

Spanish

French

Hindi

Indonesian

Italian

Korean

Dutch

Portuguese

Russian

Thai

Turkish

Vietnamese

Chinese

A white egret stands in a dark, tranquil pond surrounded by lush, colorful foliage and lily pads, with the bird’s reflection visible in the water.

Using the Model

Text-to-speech made natural

Shakespearean Wisdom

Behold the silver wanderer of the reeds, gliding soft upon the mirrored dark. With patient poise it waits between the worlds of water, wind, and whispered evening light. A creature not in haste, yet never still, teaching us grace through every careful step.

Motivational Trainer

Alright, time to focus. Notice how the egret doesn’t rush the moment, it studies it. Every movement is deliberate, every pause intentional. That’s discipline. That’s control. So when the opportunity appears, you can strike without hesitation. Patience earns the catch.


Sports Commentator

With everything on the line, the egret makes its move! Slow through the shallows… watching… waiting… And it’s a sudden strike! Got it! Incredible precision from the long beak! The fish never saw it coming. What a scene! Complete composure under pressure. A masterclass performance here in the pond tonight.

Performance

Leading in expressiveness and naturalness

MAI-Voice-2 delivers expressive real-time and long-form generation, with stable output and low latency.

Listen across languages

Joy Example

00:00 00:00

Sadness Example

00:00 00:00

Joy Example

00:00 00:00

Sadness Example

00:00 00:00

Joy Example

00:00 00:00

Sadness Example

00:00 00:00

Joy Example

00:00 00:00

Sadness Example

00:00 00:00

Joy Example

00:00 00:00

Sadness Example

00:00 00:00

Joy Example

00:00 00:00

Sadness Example

00:00 00:00

Joy Example

00:00 00:00

Sadness Example

00:00 00:00

Joy Example

00:00 00:00

Sadness Example

00:00 00:00

Joy Example

00:00 00:00

Sadness Example

00:00 00:00

Version Comparison

MAI-Voice-2

  • Languages supported 15
  • Voice cloning Multilingual
  • Price $0.22 per 1M characters
View docs

MAI-Voice-1

  • Languages supported 1
  • Voice prompting / cloning English
  • Price $0.22 per 1M characters
View docs

Try MAI-Voice-2

MAI Playground

Experiment with all other MAI models.
Try in Playground

Microsoft Foundry (Azure Speech)

Build and deploy MAI-Voice with Azure Speech.
Try in Azure Speech
English (United States)
Your Privacy Choices Opt-Out Icon Your Privacy Choices
Consumer Health Privacy Sitemap Contact Microsoft Privacy Manage cookies Terms of use Trademarks Safety & eco Recycling About our ads