What is Mistral Voxtral?

Voxtral is an open-source text-to-speech (TTS) model from Mistral that produces natural-sounding speech. In native-speaker blind tests, it outperformed ElevenLabs with 63% preference on standard voices and roughly 70% on custom voices. Importantly, it's small enough to run on edge devices including smartwatches.

How does it compare to existing TTS models?

ElevenLabs has been the gold standard for high-quality AI voice generation. Voxtral's blind test results — 63% preference over ElevenLabs among native speakers — represent a significant leap for an open-source model. Most open-source TTS has historically lagged proprietary offerings by a meaningful margin.

What did Cohere release the same day?

Cohere launched Transcribe, an open-source speech-to-text model that reached #1 on HuggingFace's speech recognition leaderboard. Combined with Voxtral, March 27 saw major open-source advances in both directions of the voice AI stack — generation and transcription — in a single day.

Mistral's New Open-Source TTS Model Beats ElevenLabs — and Fits on a Smartwatch

Mistral has released Voxtral, an open-source text-to-speech model that beats ElevenLabs in native-speaker blind tests and is compact enough to run on a smartwatch. The release landed on the same day Cohere launched Transcribe, an open-source speech-to-text model that hit the top of HuggingFace's leaderboard.

In a single day, the open-source community produced credible challengers to the leading proprietary models on both ends of the voice AI stack.

Voxtral's Performance

The headline number: in blind tests with native speakers, Voxtral was preferred over ElevenLabs 63% of the time on standard voices and approximately 70% of the time on custom voices. Those are significant margins. ElevenLabs has been the benchmark for commercial-quality AI voice generation, and Mistral's model is beating it.

The size achievement is equally notable. Running a high-quality TTS model on a smartwatch would have seemed implausible a year ago — voice generation is typically compute-intensive. Mistral's compression and efficiency work has pushed Voxtral into genuinely edge-deployable territory.

That combination — better than the leading commercial product, runs locally on constrained hardware — describes exactly the kind of open-source capability jump that disrupts market dynamics. Companies and developers building voice applications can now deploy a model that sounds better than the dominant commercial alternative, for free, with no API costs and no data leaving the device.

Cohere Transcribe: The Other Direction

Cohere's Transcribe took the top spot on HuggingFace's speech-to-text leaderboard on release day. While Mistral addressed voice generation (text-to-speech), Cohere addressed voice recognition (speech-to-text) — together, the two releases cover the full voice interface stack.

HuggingFace leaderboard position on launch day doesn't always reflect sustained performance as the community does more thorough testing, but first-day #1 rankings for both a Mistral and Cohere model on the same day is a meaningful signal about where open-source voice capabilities have arrived.

The Voice Layer Heats Up

These releases are part of a broader pattern accelerating this week. Sanas crossed $60 million in annual recurring revenue with its real-time translation product across 13 languages. Google launched Gemini 3.1 Flash Live, its highest-quality voice model, powering a global rollout of Search Live. Apple is opening Siri to rival AI assistants via a new Extensions framework in iOS 27.

Voice is no longer a secondary feature of AI platforms. It's becoming the primary interface for a significant portion of AI interactions — in cars, on wearables, through smart speakers, and increasingly through the phone's native assistant layer.

The open-source advancement matters because voice AI has historically been more proprietary than text generation. The large model labs have dominated voice with products like ElevenLabs, Eleven's Speech-to-Speech, and OpenAI's voice modes. Voxtral and Transcribe represent the moment when open-source voice caught up — or, in Voxtral's case, appears to have surpassed — the best proprietary offerings.

What This Means for Developers

For anyone building a voice-enabled application, today's releases are a straightforward upgrade path. Voxtral delivers ElevenLabs-beating quality without per-character API costs. Transcribe provides top-of-leaderboard speech recognition without cloud dependency.

The edge deployment story — Voxtral fitting on a smartwatch — opens markets that were previously inaccessible. Offline voice applications, privacy-first voice interfaces, embedded hardware with no cloud connectivity: all of these become significantly more viable with a TTS model that matches commercial quality while running locally.

The year of voice AI started months ago. Today it got a lot more open.

Mistral's New Open-Source TTS Model Beats ElevenLabs — and Fits on a Smartwatch

Voxtral's Performance

Cohere Transcribe: The Other Direction

The Voice Layer Heats Up

What This Means for Developers

More in Models

xAI Launches Grok Voice Think Fast 1.0, Tops τ-Voice Bench and Powers Starlink Support

Tencent Drops Hy3 Preview: 295B Open-Source MoE Model Kicks DeepSeek Out of Yuanbao

DeepSeek V4 Preview Lands: 1.6T-Parameter Open Model With 1M Context, Flash Pricing at $0.14/M