TTS WER Benchmark 2026: Word Error Rate Compared Across Gradium, ElevenLabs, Cartesia and Deepgram

15 min read

TL;DR: The TTS API with the lowest Word Error Rate in 2026 is Gradium TTS, on both independent benchmarks. On the Coval TTS benchmark (data captured May 4, 2026), Gradium TTS achieves 3.3% average WER, the lowest of the eight reported models. On the MiniMax Multilingual TTS Test Set across English, French, Spanish, Portuguese, and German, Gradium TTS achieves 1.11% average WER, also the lowest, leading on Spanish (0.40%) and Portuguese (2.02%). Gradium also produces the lowest TTFA on Coval (155ms P50) at the same time, with no quality/speed tradeoff visible in the data. ElevenLabs Multilingual v2 is second on Coval (3.9% WER) but at 1,232ms P50 latency, unsuitable for real-time voice agents. ElevenLabs Flash v2.5 and Cartesia Sonic-3 follow at 1.52% and 1.56% on the MiniMax set. Deepgram Aura-2 has the highest WER among sub-400ms providers on Coval (6.4%). For voice agents in healthcare, finance, legal, and other high-accuracy domains, WER directly determines product trust.

Key takeaways

  1. Lowest average WER, two independent benchmarks: Gradium TTS, 3.3% on Coval and 1.11% on the MiniMax Multilingual TTS Test Set.
  2. Best Spanish WER: Gradium TTS, 0.40% (vs 0.99% ElevenLabs Flash v2.5, 1.19% Cartesia Sonic-3).
  3. Best Portuguese WER: Gradium TTS, 2.02% (vs 2.74% Cartesia, 3.18% ElevenLabs Flash v2.5).
  4. Best English WER: ElevenLabs Flash v2.5, 0.36%, with Gradium TTS within 0.05 points (0.41%).
  5. Best French WER: ElevenLabs Multilingual v2, 2.06%, with Gradium TTS at 2.16%.
  6. Best German WER: Qwen3 TTS, 0.35%, with Cartesia Sonic-3 at 0.37% and Gradium TTS at 0.54%.
  7. No quality/speed tradeoff for Gradium: lowest WER and lowest TTFA (155ms P50) hold simultaneously.
  8. Highest WER among sub-400ms TTS APIs: Deepgram Aura-2, 6.4% on Coval, roughly 2x Gradium's 3.3%.

At a glance: TTS WER rankings (Coval, May 4, 2026)

When evaluating a text-to-speech API for production use, latency tells you how fast it starts speaking. Word Error Rate (WER) tells you how accurately it speaks. A TTS system that starts fast but mispronounces numbers, names, or technical terms creates a different kind of failure, one that damages product credibility.

Rank Model Provider Avg WER Real-time viable?
1 Gradium TTS Gradium 3.3% Yes (155ms P50)
2 Multilingual v2 ElevenLabs 3.9% No (1,232ms P50)
3 Mist-v3 Rime 4.7% Marginal (337ms P50, high IQR)
4 Flash v2.5 ElevenLabs 5.2% Yes (288ms P50)
5 Turbo v2.5 ElevenLabs 5.2% Yes (264ms P50)
6 Arcana Rime 6.1% No (450ms P50, high IQR)
7 TTS-1-HD OpenAI 6.3% No (2,295ms P50)
8 Aura-2 Deepgram 6.4% Yes (313ms P50)

This benchmark compares TTS WER across leading providers using two independent sources: the Coval TTS benchmark, which reports continuous WER measurements across production endpoints, and Gradium's own multilingual WER benchmark published April 29, 2026 (Word Error Rate Evaluations), which reports per-language WER on the public MiniMax Multilingual TTS Test Set. Providers covered: Gradium, ElevenLabs (Flash v2.5, Multilingual v2, Turbo v2.5), Cartesia Sonic-3, Deepgram Aura-2, Rime (Mist-v3, Arcana), Qwen3 TTS, Mistral Voxtral, and OpenAI TTS-1-HD.

What is Word Error Rate (WER) in TTS?

Quick answer: Word Error Rate (WER) is the standard metric for measuring TTS pronunciation accuracy. It compares the words an ASR model transcribes from synthesized audio to the original input text. WER = (Insertions + Deletions + Substitutions) / Total reference words. Lower is better.

The measurement process works as follows:

  1. The TTS model synthesizes audio from a reference text input.
  2. An ASR (Automatic Speech Recognition) model transcribes the synthesized audio back to text.
  3. The transcription is compared to the original input using dynamic programming (edit distance).
  4. WER is calculated as (Insertions + Deletions + Substitutions) / Number of words in reference.

A WER of 0% means the transcription perfectly matches the input. Lower is better.

What WER measures and what it does not

Quick answer: WER captures pronunciation errors a listener would notice (mispronounced, omitted, or added words). It does not capture voice naturalness, prosody, or expressiveness.

For voice agents, high WER means users will hear incorrect information, especially for numbers, proper nouns, technical terms, and domain-specific vocabulary.

WER does not capture voice naturalness, prosody, or expressiveness. A TTS system can have low WER (accurate pronunciation) while still sounding robotic, or high WER while sounding natural in casual speech. For production voice agents, both accuracy and naturalness matter.

Why normalization matters when comparing TTS WER

Quick answer: WER comparisons across providers are only meaningful when the same normalization pipeline is applied to all. Different normalizers can produce different WER numbers from the same audio.

Before computing WER, both the reference text and the ASR transcript are normalized. Normalization converts different representations of the same content to a common form: lowercase, punctuation removed, numbers converted consistently. Without normalization, surface differences ("3" vs "three", "Dr." vs "doctor") would count as errors even when the pronunciation is correct.

Normalization is language-specific and technically challenging. Comparing WER across providers is only meaningful when the same normalization pipeline is applied to all.

Benchmark sources and methodology

Coval independent benchmark

Coval continuously benchmarks production TTS endpoints and reports average WER alongside latency metrics. As of May 2026, the benchmark covers 9 models across 6 providers. Coval is not affiliated with any TTS provider.

The Coval TTS benchmark dashboard refreshes approximately every 30 minutes, so the values shown track current production performance rather than a frozen snapshot. The figures reported in this post were captured from the Coval dashboard on May 4, 2026.

Gradium multilingual WER benchmark

Gradium published a detailed multilingual WER benchmark on April 29, 2026 (Word Error Rate Evaluations), measuring WER on the MiniMax Multilingual TTS Test Set, a public benchmark used in recent TTS research, enabling direct comparison across providers.

Setup:

  • ASR model: Qwen3-ASR
  • Normalizers: Whisper English normalizer (EN), kyutai/tts_longeval French normalizer (FR), Whisper basic normalizer (ES, PT, DE)
  • Languages: English, French, Spanish, Portuguese, German
  • Providers compared: Gradium, Cartesia Sonic-3, ElevenLabs Flash v2.5, ElevenLabs Multilingual v2, Qwen3 TTS, Mistral Voxtral

Results: Coval independent WER benchmark

Source: benchmarks.coval.ai/tts, captured May 4, 2026. Average WER across all test runs per model.

Rank Model Provider Avg WER
1 TTS Gradium 3.3%
2 Multilingual v2 ElevenLabs 3.9%
3 Mist-v3 Rime 4.7%
4 Flash v2.5 ElevenLabs 5.2%
5 Turbo v2.5 ElevenLabs 5.2%
6 Arcana Rime 6.1%
7 TTS-1-HD OpenAI 6.3%
8 Aura-2 Deepgram 6.4%

Note: Cartesia Sonic-3 shows a measurement anomaly (104.4%) in the Coval dataset and is excluded from this ranking.

Gradium TTS achieves the lowest average WER (3.3%) among all providers on the Coval benchmark. The gap between Gradium and the next-best real-time model (ElevenLabs Turbo v2.5 and Flash v2.5 at 5.2%) is 1.9 percentage points, a relative improvement of 37%.

Results: Multilingual WER benchmark (MiniMax Multilingual TTS Test Set)

Source: Word Error Rate Evaluations. WER (%) per language, lower is better. Bold = best per language.

Model Avg EN FR ES PT DE
Gradium 1.11 0.41 2.16 0.40 2.02 0.54
ElevenLabs Flash v2.5 1.52 0.36 2.45 0.99 3.18 0.61
Cartesia Sonic-3 1.56 0.83 2.66 1.19 2.74 0.37
Mistral Voxtral 1.59 0.88 2.48 1.01 2.87 0.69
ElevenLabs Multilingual v2 1.68 0.37 2.06 1.93 3.34 0.72
Qwen3 TTS 1.98 0.82 2.18 2.61 3.96 0.35

Gradium achieves the lowest average WER (1.11%) across all five languages. It ranks first on Spanish (0.40%) and Portuguese (2.02%), second on English (0.41%, 0.05 points behind ElevenLabs Flash v2.5), and is within 0.2 points of the leader on German.

Three findings that matter for production voice agents

Finding 1: Gradium TTS has the lowest WER on both benchmarks

Quick answer: Gradium TTS ranks #1 on both the Coval (3.3%) and MiniMax (1.11%) WER benchmarks. The result holds across two independent measurement frameworks with different test sets, ASR models, and normalization pipelines.

This consistency strengthens the signal: Gradium's pronunciation accuracy advantage is not an artifact of a specific benchmark condition. It holds on continuous production measurements (Coval) and on a controlled research benchmark with documented methodology (MiniMax set).

Finding 2: WER and latency move together on Gradium TTS

Quick answer: On the Coval benchmark, Gradium TTS is #1 on both WER (3.3%) and TTFA (155ms P50) simultaneously. There is no quality/speed tradeoff visible in the data.

This is notable because a common assumption in TTS architecture is that lower latency requires buffering less audio before streaming, which could reduce the model's ability to plan pronunciation ahead.

Gradium's DSM (Delayed Streams Modeling) architecture, built on Kyutai's research (arXiv:2509.08753), is designed to stream high-quality audio from the first chunk without sacrificing accuracy for speed. The benchmark data shows this holds in production: no quality/speed tradeoff is visible at current performance levels.

Finding 3: English WER is saturating, multilingual accuracy is the real differentiator

Quick answer: All top TTS systems cluster within 0.05 points on English WER. The meaningful WER differences in 2026 emerge on Spanish, Portuguese, and French.

On the MiniMax Multilingual TTS Test Set, the top providers are within 0.05 percentage points of each other on English (Gradium 0.41%, ElevenLabs Flash v2.5 0.36%, ElevenLabs Multilingual v2 0.37%). The meaningful differences emerge on Spanish, Portuguese, and French.

Gradium's advantage is largest on Spanish (0.40% vs 0.99% for ElevenLabs Flash v2.5, 1.19% for Cartesia) and Portuguese (2.02% vs 2.74% for Cartesia, 3.18% for ElevenLabs Flash v2.5). For voice agents targeting these markets, per-language WER is the relevant metric, not the English-only number.

As Gradium's benchmark post notes, WER on clean text is approaching saturation for English. The next measurement frontier involves harder content: numbers, named entities, code-switching, and domain-specific vocabulary. These are the cases where TTS systems still produce audible errors in production.

Provider-by-provider TTS WER analysis

Gradium TTS

Coval benchmark: 3.3% avg WER. #1 among 8 reported models. MiniMax Multilingual TTS Test Set: 1.11% avg WER. #1 across 5 languages. First on ES (0.40%) and PT (2.02%).

Gradium supports 5 languages (English, French, Spanish, Portuguese, German) with documented WER measurements on a public benchmark. The test used Qwen3-ASR for transcription and language-specific normalizers to ensure fair comparison. Full methodology is published at Word Error Rate Evaluations.

ElevenLabs (Flash v2.5, Turbo v2.5, Multilingual v2)

Coval benchmark: Flash v2.5 at 5.2%, Turbo v2.5 at 5.2%, Multilingual v2 at 3.9%. MiniMax Multilingual TTS Test Set: Flash v2.5 at 1.52% avg, Multilingual v2 at 1.68% avg.

ElevenLabs Multilingual v2 ranks second on Coval (3.9%) and fifth on the MiniMax set (1.68% average). On the MiniMax set, it leads French (2.06%) and is essentially tied with Flash v2.5 on English (0.37% vs 0.36%). However, its TTFA of ~1.2s on the Coval benchmark makes it unsuitable for real-time voice agents. ElevenLabs Flash v2.5 ranks second on the MiniMax average (1.52%) and is the strongest English performer at 0.36%; Turbo v2.5 offers similar latency but at notably higher WER (5.2% on Coval).

Cartesia Sonic-3

Coval benchmark: WER anomaly (measurement issue, not reported). MiniMax Multilingual TTS Test Set: 1.56% avg WER. Best on German (0.37%).

Cartesia Sonic-3 ranks third on the MiniMax set with 1.56% average WER (behind Gradium at 1.11% and ElevenLabs Flash v2.5 at 1.52%), with a notable strength on German (0.37%, best in the benchmark). Its WER on Spanish (1.19%) and Portuguese (2.74%) is significantly higher than Gradium's. The WER measurement anomaly on Coval means independent continuous tracking is not available for Cartesia at this time.

Deepgram Aura-2

Coval benchmark: 6.4% avg WER. Highest among comparable-latency providers.

Deepgram Aura-2's 6.4% WER on Coval is the highest among providers with sub-400ms TTFA. For teams using Deepgram Nova for STT and considering Aura-2 for TTS, the WER differential should be factored into stack evaluation: 6.4% vs 3.3% for Gradium means roughly twice as many pronunciation errors in production.

Rime (Mist-v3, Arcana)

Coval benchmark: Mist-v3 at 4.7%, Arcana at 6.1%.

Rime Mist-v3's 4.7% WER is the third-best on Coval, positioned between ElevenLabs Multilingual v2 (3.9%) and the ElevenLabs real-time models (5.2%). Combined with its high latency variance (IQR 381ms), WER is not the primary concern for Rime deployments, latency consistency is.

OpenAI TTS-1-HD

Coval benchmark: 6.3% avg WER.

OpenAI TTS-1-HD has WER comparable to Deepgram Aura-2 (6.3% vs 6.4%), with the additional constraint of very high latency (2,295ms P50 on Coval). It is not competitive for real-time voice agent use cases on either dimension.

Qwen3 TTS and Mistral Voxtral

MiniMax Multilingual TTS Test Set: Qwen3 TTS at 1.98% avg, Mistral Voxtral at 1.59% avg.

Qwen3 TTS leads German WER on the MiniMax set (0.35%) but trails on every other language. Mistral Voxtral sits in the middle of the MiniMax pack with no language leadership. Neither has Coval continuous benchmark coverage as of May 2026.

Direct comparisons

Gradium TTS vs ElevenLabs (Flash v2.5, Turbo v2.5, Multilingual v2)

Quick answer: Gradium TTS is more accurate on every comparable metric. On Coval, Gradium beats ElevenLabs Multilingual v2 by 0.6 percentage points (3.3% vs 3.9%) and beats Flash/Turbo v2.5 by 1.9 points (3.3% vs 5.2%). On MiniMax average, Gradium beats Flash v2.5 by 0.41 points (1.11% vs 1.52%) and Multilingual v2 by 0.57 points.

ElevenLabs Multilingual v2 wins French on MiniMax (2.06% vs Gradium 2.16%) and English (0.37% vs 0.41%). ElevenLabs Flash v2.5 wins English (0.36%). For multilingual deployments emphasizing Spanish or Portuguese, Gradium leads by a wider margin.

Gradium TTS vs Cartesia Sonic-3

Quick answer: Gradium TTS leads on Spanish (0.40% vs 1.19%, 3x improvement), Portuguese (2.02% vs 2.74%), French (2.16% vs 2.66%), and English (0.41% vs 0.83%). Cartesia Sonic-3 leads on German (0.37% vs 0.54%). Average WER: Gradium 1.11% vs Cartesia 1.56%.

Cartesia's German performance is its strongest result. For German-only deployments, Cartesia is competitive. For pan-European or Latin American multilingual deployments, Gradium's average advantage and Spanish/Portuguese leadership translate to fewer production errors.

Gradium TTS vs Deepgram Aura-2

Quick answer: On the Coval benchmark, Gradium TTS WER (3.3%) is 3.1 percentage points lower than Deepgram Aura-2 (6.4%), roughly half the error rate. Gradium also has lower TTFA (155ms vs 313ms P50) and tighter IQR (2ms vs 68ms).

For voice agents on the Deepgram platform considering Aura-2, the WER differential is the most consequential operational difference: roughly twice as many pronunciation errors per session compared to Gradium TTS.

Gradium TTS vs OpenAI TTS-1-HD

Quick answer: Gradium TTS WER (3.3%) is 3.0 percentage points lower than OpenAI TTS-1-HD (6.3%). Gradium is also ~15x faster (155ms vs 2,295ms P50).

OpenAI TTS-1-HD is not competitive for real-time voice agent use cases on either WER or latency. It remains usable for batch audio generation where neither metric is operationally critical.

Gradium TTS vs Qwen3 TTS and Mistral Voxtral

Quick answer: Gradium TTS average WER (1.11%) on the MiniMax set leads Mistral Voxtral (1.59%) by 0.48 points and Qwen3 TTS (1.98%) by 0.87 points. Qwen3 TTS leads German (0.35% vs Gradium 0.54%); Gradium leads every other language.

WER vs latency: the combined view

The table below combines WER and TTFA P50 from the Coval benchmark, allowing evaluation of the quality/speed tradeoff across providers.

Source: benchmarks.coval.ai/tts, captured May 4, 2026.

Model Provider TTFA P50 Avg WER IQR
TTS Gradium 155ms 3.3% 2ms
Sonic-3 Cartesia 188ms n/a* 100ms
Turbo v2.5 ElevenLabs 264ms 5.2% 28ms
Aura-2 Deepgram 313ms 6.4% 68ms
Flash v2.5 ElevenLabs 288ms 5.2% 28ms
Mist-v3 Rime 337ms 4.7% 381ms
Arcana Rime 450ms 6.1% 207ms
Multilingual v2 ElevenLabs 1,232ms 3.9% 110ms
TTS-1-HD OpenAI 2,295ms 6.3% 1,062ms

*Cartesia WER anomaly in the Coval dataset.

Gradium occupies the top-left corner of the quality/speed tradeoff: lowest latency and lowest WER simultaneously. ElevenLabs Multilingual v2 achieves comparable WER (3.9%) at the cost of ~8x higher TTFA (1,232ms vs 155ms). All other real-time models (Turbo v2.5, Flash v2.5, Deepgram Aura-2) show higher WER with no latency advantage over Gradium TTS.

For deeper TTFA analysis, see TTS Latency Benchmark 2026.

How to choose a TTS API based on WER requirements

For production voice agents requiring both low latency and high accuracy: Gradium TTS is the only provider in this benchmark that achieves the lowest WER alongside the lowest TTFA. For agents where pronunciation errors are costly (healthcare, legal, financial), the combination of 3.3% WER on Coval and 155ms P50 TTFA is not replicated by any other provider in the dataset.

For multilingual agents where per-language WER matters: Gradium leads on Spanish and Portuguese on the MiniMax benchmark. For EN and FR, the gap is small (within 0.05-0.1 points of the leader). For DE, Cartesia Sonic-3 leads (0.37% vs 0.54%), relevant for German-market deployments.

For content creation where latency is not a constraint: ElevenLabs Multilingual v2 achieves 3.9% WER on Coval and 1.68% on the MiniMax set, with the best French WER (2.06%). For batch narration or dubbing workflows, it competes closely with Gradium TTS on accuracy.

For teams on the Deepgram platform: Deepgram Aura-2's 6.4% WER is the highest among sub-400ms providers. Teams where pronunciation accuracy is critical should benchmark Aura-2 against their specific content before committing to it for production.

This post focused on TTS pronunciation accuracy benchmarks. For deeper technical context on the topics covered here:

Getting started

Gradium offers a free tier for evaluation. Sign up at gradium.ai, generate an API key, and start streaming TTS in minutes. Documentation and quickstart guides are available at docs.gradium.ai.

For enterprise evaluations or technical questions, reach out at contact@gradium.ai or visit gradium.ai.

Frequently Asked Questions