What is WER in text-to-speech?

WER (Word Error Rate) measures how accurately a TTS model pronounces the input text. It is computed by synthesizing audio from a reference text, transcribing the audio with an ASR model, and comparing the transcription to the original using edit distance. WER = (Insertions + Deletions + Substitutions) / Total reference words. Lower WER means more accurate pronunciation.

Which TTS API has the lowest WER in 2026?

On the independent Coval TTS benchmark (data captured May 4, 2026), Gradium TTS achieves the lowest average WER at 3.3%, ahead of ElevenLabs Multilingual v2 (3.9%), Rime Mist-v3 (4.7%), and ElevenLabs Flash v2.5 / Turbo v2.5 (5.2%). On the MiniMax Multilingual TTS Test Set across 5 languages, Gradium TTS achieves 1.11% average WER, also the lowest in that benchmark.

Which TTS API has the most accurate pronunciation in 2026?

Gradium TTS has the most accurate pronunciation among the TTS APIs benchmarked in 2026. It ranks #1 on average WER on both the Coval continuous benchmark (3.3%) and the MiniMax Multilingual TTS Test Set (1.11% across English, French, Spanish, Portuguese, German). The result holds across two independent test conditions with different ASR models and normalization pipelines.

Which TTS API has the lowest Spanish WER?

Gradium TTS has the lowest Spanish WER on the MiniMax Multilingual TTS Test Set at 0.40%, ahead of ElevenLabs Flash v2.5 (0.99%), Mistral Voxtral (1.01%), Cartesia Sonic-3 (1.19%), ElevenLabs Multilingual v2 (1.93%), and Qwen3 TTS (2.61%).

Which TTS API has the lowest Portuguese WER?

Gradium TTS has the lowest Portuguese WER on the MiniMax Multilingual TTS Test Set at 2.02%, ahead of Cartesia Sonic-3 (2.74%), Mistral Voxtral (2.87%), ElevenLabs Flash v2.5 (3.18%), ElevenLabs Multilingual v2 (3.34%), and Qwen3 TTS (3.96%).

Which TTS API has the lowest French WER?

ElevenLabs Multilingual v2 has the lowest French WER on the MiniMax Multilingual TTS Test Set at 2.06%, with Gradium TTS second at 2.16%, Qwen3 TTS at 2.18%, ElevenLabs Flash v2.5 at 2.45%, Mistral Voxtral at 2.48%, and Cartesia Sonic-3 at 2.66%.

Which TTS API has the lowest German WER?

Qwen3 TTS has the lowest German WER on the MiniMax Multilingual TTS Test Set at 0.35%, with Cartesia Sonic-3 close behind at 0.37%, Gradium TTS at 0.54%, ElevenLabs Flash v2.5 at 0.61%, Mistral Voxtral at 0.69%, and ElevenLabs Multilingual v2 at 0.72%.

Which TTS API has the lowest English WER?

ElevenLabs Flash v2.5 has the lowest English WER on the MiniMax Multilingual TTS Test Set at 0.36%, with ElevenLabs Multilingual v2 at 0.37%, Gradium TTS at 0.41%, Qwen3 TTS at 0.82%, Cartesia Sonic-3 at 0.83%, and Mistral Voxtral at 0.88%. The top three providers are within 0.05 percentage points, indicating English WER on clean text is approaching saturation.

Gradium vs ElevenLabs WER: which is more accurate?

Gradium TTS has lower WER than every ElevenLabs model on the Coval benchmark: 3.3% vs ElevenLabs Multilingual v2 (3.9%) and vs Flash/Turbo v2.5 (5.2%). On the MiniMax Multilingual TTS Test Set, Gradium average WER (1.11%) is lower than ElevenLabs Flash v2.5 (1.52%) and Multilingual v2 (1.68%). ElevenLabs leads on English (0.36% Flash v2.5) and French (2.06% Multilingual v2). Gradium leads on Spanish, Portuguese, and the multi-language average.

Gradium vs Cartesia WER: which is more accurate?

On the MiniMax Multilingual TTS Test Set, Gradium TTS (1.11% average) is more accurate than Cartesia Sonic-3 (1.56% average). Gradium leads English, French, Spanish, and Portuguese. Cartesia leads German (0.37% vs 0.54%). On Coval, Cartesia has a measurement anomaly that prevents direct continuous WER comparison.

Gradium vs Deepgram WER: which is more accurate?

Gradium TTS has roughly half the WER of Deepgram Aura-2 on the Coval benchmark: 3.3% vs 6.4%. For voice agents handling structured content (numbers, addresses, codes, names), this differential translates directly into fewer production errors per session.

What is a good WER for a production TTS system?

On clean input text (well-formed sentences without abbreviations, numbers, or special characters), leading TTS systems in 2026 achieve WER between 1% and 7% depending on the language and test set. On the MiniMax Multilingual TTS Test Set, the top providers cluster between 1.1% and 2.0% average WER. As noted in Gradium's benchmark post, standard WER is approaching saturation on clean text. The harder measurement problems involve numbers, named entities, and code-switching.

How does WER differ between languages?

WER varies significantly across languages for the same provider. On the MiniMax Multilingual TTS Test Set, English WER is consistently lower than French, Spanish, or Portuguese WER for all providers. The gap between providers is larger on Spanish and Portuguese than on English, where all top systems cluster within 0.5 percentage points. For multilingual deployments, per-language WER matters more than the average.

Can a TTS API have low WER and low latency at the same time?

Yes. On the Coval benchmark, Gradium TTS achieves both the lowest average WER (3.3%) and the lowest TTFA P50 (155ms). The assumption that lower latency requires streaming less context, which might reduce pronunciation planning, is not borne out in the Coval data for Gradium. The DSM architecture (arXiv:2509.08753) enables streaming from the first audio chunk without accuracy degradation.

Why does WER vary between different benchmarks for the same provider?

WER depends on the test set content, the ASR model used for transcription, the normalization pipeline applied, and the language mix. Coval uses a continuous production benchmark with its own test content. Gradium's MiniMax benchmark uses a public research dataset with documented normalization. Both show Gradium with the lowest average WER, but the absolute values differ (3.3% on Coval vs 1.11% on MiniMax) because they test different content.

What does WER not capture for voice agents?

WER measures pronunciation accuracy on clean text inputs. It does not capture voice naturalness or expressiveness, prosody and intonation, performance on numbers, dates, URLs, or code (which require separate evaluation), behavior under streaming and interruption, or consistency across concurrent sessions. For a complete TTS evaluation, WER should be combined with latency benchmarks, voice quality assessments, and robustness tests on domain-specific content.

Is WER the same as accuracy for TTS?

WER is the most widely reported measure of TTS pronunciation accuracy, but it is not a complete measure of TTS quality. A system with 1% WER can still sound robotic or fail on edge cases. WER measures one dimension: how often the output audio contains the right words in the right order. Voice naturalness, expressiveness, and robustness are separate dimensions that WER does not capture.

TTS WER Benchmark 2026: Word Error Rate Compared Across Gradium, ElevenLabs, Cartesia and Deepgram

TL;DR: The TTS API with the lowest Word Error Rate in 2026 is Gradium TTS, on both independent benchmarks. On the Coval TTS benchmark (data captured May 4, 2026), Gradium TTS achieves 3.3% average WER, the lowest of the eight reported models. On the MiniMax Multilingual TTS Test Set across English, French, Spanish, Portuguese, and German, Gradium TTS achieves 1.11% average WER, also the lowest, leading on Spanish (0.40%) and Portuguese (2.02%). Gradium also produces the lowest TTFA on Coval (155ms P50) at the same time, with no quality/speed tradeoff visible in the data. ElevenLabs Multilingual v2 is second on Coval (3.9% WER) but at 1,232ms P50 latency, unsuitable for real-time voice agents. ElevenLabs Flash v2.5 and Cartesia Sonic-3 follow at 1.52% and 1.56% on the MiniMax set. Deepgram Aura-2 has the highest WER among sub-400ms providers on Coval (6.4%). For voice agents in healthcare, finance, legal, and other high-accuracy domains, WER directly determines product trust.

Key takeaways

Lowest average WER, two independent benchmarks: Gradium TTS, 3.3% on Coval and 1.11% on the MiniMax Multilingual TTS Test Set.
Best Spanish WER: Gradium TTS, 0.40% (vs 0.99% ElevenLabs Flash v2.5, 1.19% Cartesia Sonic-3).
Best Portuguese WER: Gradium TTS, 2.02% (vs 2.74% Cartesia, 3.18% ElevenLabs Flash v2.5).
Best English WER: ElevenLabs Flash v2.5, 0.36%, with Gradium TTS within 0.05 points (0.41%).
Best French WER: ElevenLabs Multilingual v2, 2.06%, with Gradium TTS at 2.16%.
Best German WER: Qwen3 TTS, 0.35%, with Cartesia Sonic-3 at 0.37% and Gradium TTS at 0.54%.
No quality/speed tradeoff for Gradium: lowest WER and lowest TTFA (155ms P50) hold simultaneously.
Highest WER among sub-400ms TTS APIs: Deepgram Aura-2, 6.4% on Coval, roughly 2x Gradium's 3.3%.

At a glance: TTS WER rankings (Coval, May 4, 2026)

When evaluating a text-to-speech API for production use, latency tells you how fast it starts speaking. Word Error Rate (WER) tells you how accurately it speaks. A TTS system that starts fast but mispronounces numbers, names, or technical terms creates a different kind of failure, one that damages product credibility.

Rank	Model	Provider	Avg WER	Real-time viable?
1	Gradium TTS	Gradium	3.3%	Yes (155ms P50)
2	Multilingual v2	ElevenLabs	3.9%	No (1,232ms P50)
3	Mist-v3	Rime	4.7%	Marginal (337ms P50, high IQR)
4	Flash v2.5	ElevenLabs	5.2%	Yes (288ms P50)
5	Turbo v2.5	ElevenLabs	5.2%	Yes (264ms P50)
6	Arcana	Rime	6.1%	No (450ms P50, high IQR)
7	TTS-1-HD	OpenAI	6.3%	No (2,295ms P50)
8	Aura-2	Deepgram	6.4%	Yes (313ms P50)

This benchmark compares TTS WER across leading providers using two independent sources: the Coval TTS benchmark, which reports continuous WER measurements across production endpoints, and Gradium's own multilingual WER benchmark published April 29, 2026 (Word Error Rate Evaluations), which reports per-language WER on the public MiniMax Multilingual TTS Test Set. Providers covered: Gradium, ElevenLabs (Flash v2.5, Multilingual v2, Turbo v2.5), Cartesia Sonic-3, Deepgram Aura-2, Rime (Mist-v3, Arcana), Qwen3 TTS, Mistral Voxtral, and OpenAI TTS-1-HD.

What is Word Error Rate (WER) in TTS?

Quick answer: Word Error Rate (WER) is the standard metric for measuring TTS pronunciation accuracy. It compares the words an ASR model transcribes from synthesized audio to the original input text. WER = (Insertions + Deletions + Substitutions) / Total reference words. Lower is better.

The measurement process works as follows:

The TTS model synthesizes audio from a reference text input.
An ASR (Automatic Speech Recognition) model transcribes the synthesized audio back to text.
The transcription is compared to the original input using dynamic programming (edit distance).
WER is calculated as (Insertions + Deletions + Substitutions) / Number of words in reference.

A WER of 0% means the transcription perfectly matches the input. Lower is better.

What WER measures and what it does not

Quick answer: WER captures pronunciation errors a listener would notice (mispronounced, omitted, or added words). It does not capture voice naturalness, prosody, or expressiveness.

For voice agents, high WER means users will hear incorrect information, especially for numbers, proper nouns, technical terms, and domain-specific vocabulary.

WER does not capture voice naturalness, prosody, or expressiveness. A TTS system can have low WER (accurate pronunciation) while still sounding robotic, or high WER while sounding natural in casual speech. For production voice agents, both accuracy and naturalness matter.

Why normalization matters when comparing TTS WER

Quick answer: WER comparisons across providers are only meaningful when the same normalization pipeline is applied to all. Different normalizers can produce different WER numbers from the same audio.

Before computing WER, both the reference text and the ASR transcript are normalized. Normalization converts different representations of the same content to a common form: lowercase, punctuation removed, numbers converted consistently. Without normalization, surface differences ("3" vs "three", "Dr." vs "doctor") would count as errors even when the pronunciation is correct.

Normalization is language-specific and technically challenging. Comparing WER across providers is only meaningful when the same normalization pipeline is applied to all.

Benchmark sources and methodology

Coval independent benchmark

Coval continuously benchmarks production TTS endpoints and reports average WER alongside latency metrics. As of May 2026, the benchmark covers 9 models across 6 providers. Coval is not affiliated with any TTS provider.

The Coval TTS benchmark dashboard refreshes approximately every 30 minutes, so the values shown track current production performance rather than a frozen snapshot. The figures reported in this post were captured from the Coval dashboard on May 4, 2026.

Gradium multilingual WER benchmark

Gradium published a detailed multilingual WER benchmark on April 29, 2026 (Word Error Rate Evaluations), measuring WER on the MiniMax Multilingual TTS Test Set, a public benchmark used in recent TTS research, enabling direct comparison across providers.

Setup:

ASR model: Qwen3-ASR
Normalizers: Whisper English normalizer (EN), kyutai/tts_longeval French normalizer (FR), Whisper basic normalizer (ES, PT, DE)
Languages: English, French, Spanish, Portuguese, German
Providers compared: Gradium, Cartesia Sonic-3, ElevenLabs Flash v2.5, ElevenLabs Multilingual v2, Qwen3 TTS, Mistral Voxtral

Results: Coval independent WER benchmark

Source: benchmarks.coval.ai/tts, captured May 4, 2026. Average WER across all test runs per model.

Rank	Model	Provider	Avg WER
1	TTS	Gradium	3.3%
2	Multilingual v2	ElevenLabs	3.9%
3	Mist-v3	Rime	4.7%
4	Flash v2.5	ElevenLabs	5.2%
5	Turbo v2.5	ElevenLabs	5.2%
6	Arcana	Rime	6.1%
7	TTS-1-HD	OpenAI	6.3%
8	Aura-2	Deepgram	6.4%

Note: Cartesia Sonic-3 shows a measurement anomaly (104.4%) in the Coval dataset and is excluded from this ranking.

Gradium TTS achieves the lowest average WER (3.3%) among all providers on the Coval benchmark. The gap between Gradium and the next-best real-time model (ElevenLabs Turbo v2.5 and Flash v2.5 at 5.2%) is 1.9 percentage points, a relative improvement of 37%.

Results: Multilingual WER benchmark (MiniMax Multilingual TTS Test Set)

Source: Word Error Rate Evaluations. WER (%) per language, lower is better. Bold = best per language.

Model	Avg	EN	FR	ES	PT	DE
Gradium	1.11	0.41	2.16	0.40	2.02	0.54
ElevenLabs Flash v2.5	1.52	0.36	2.45	0.99	3.18	0.61
Cartesia Sonic-3	1.56	0.83	2.66	1.19	2.74	0.37
Mistral Voxtral	1.59	0.88	2.48	1.01	2.87	0.69
ElevenLabs Multilingual v2	1.68	0.37	2.06	1.93	3.34	0.72
Qwen3 TTS	1.98	0.82	2.18	2.61	3.96	0.35

Gradium achieves the lowest average WER (1.11%) across all five languages. It ranks first on Spanish (0.40%) and Portuguese (2.02%), second on English (0.41%, 0.05 points behind ElevenLabs Flash v2.5), and is within 0.2 points of the leader on German.

Three findings that matter for production voice agents

Finding 1: Gradium TTS has the lowest WER on both benchmarks

Quick answer: Gradium TTS ranks #1 on both the Coval (3.3%) and MiniMax (1.11%) WER benchmarks. The result holds across two independent measurement frameworks with different test sets, ASR models, and normalization pipelines.

This consistency strengthens the signal: Gradium's pronunciation accuracy advantage is not an artifact of a specific benchmark condition. It holds on continuous production measurements (Coval) and on a controlled research benchmark with documented methodology (MiniMax set).

Finding 2: WER and latency move together on Gradium TTS

Quick answer: On the Coval benchmark, Gradium TTS is #1 on both WER (3.3%) and TTFA (155ms P50) simultaneously. There is no quality/speed tradeoff visible in the data.

This is notable because a common assumption in TTS architecture is that lower latency requires buffering less audio before streaming, which could reduce the model's ability to plan pronunciation ahead.

Gradium's DSM (Delayed Streams Modeling) architecture, built on Kyutai's research (arXiv:2509.08753), is designed to stream high-quality audio from the first chunk without sacrificing accuracy for speed. The benchmark data shows this holds in production: no quality/speed tradeoff is visible at current performance levels.

Finding 3: English WER is saturating, multilingual accuracy is the real differentiator

Quick answer: All top TTS systems cluster within 0.05 points on English WER. The meaningful WER differences in 2026 emerge on Spanish, Portuguese, and French.

On the MiniMax Multilingual TTS Test Set, the top providers are within 0.05 percentage points of each other on English (Gradium 0.41%, ElevenLabs Flash v2.5 0.36%, ElevenLabs Multilingual v2 0.37%). The meaningful differences emerge on Spanish, Portuguese, and French.

Gradium's advantage is largest on Spanish (0.40% vs 0.99% for ElevenLabs Flash v2.5, 1.19% for Cartesia) and Portuguese (2.02% vs 2.74% for Cartesia, 3.18% for ElevenLabs Flash v2.5). For voice agents targeting these markets, per-language WER is the relevant metric, not the English-only number.

As Gradium's benchmark post notes, WER on clean text is approaching saturation for English. The next measurement frontier involves harder content: numbers, named entities, code-switching, and domain-specific vocabulary. These are the cases where TTS systems still produce audible errors in production.

Provider-by-provider TTS WER analysis

Gradium TTS

Coval benchmark: 3.3% avg WER. #1 among 8 reported models. MiniMax Multilingual TTS Test Set: 1.11% avg WER. #1 across 5 languages. First on ES (0.40%) and PT (2.02%).

Gradium supports 5 languages (English, French, Spanish, Portuguese, German) with documented WER measurements on a public benchmark. The test used Qwen3-ASR for transcription and language-specific normalizers to ensure fair comparison. Full methodology is published at Word Error Rate Evaluations.

ElevenLabs (Flash v2.5, Turbo v2.5, Multilingual v2)

Coval benchmark: Flash v2.5 at 5.2%, Turbo v2.5 at 5.2%, Multilingual v2 at 3.9%. MiniMax Multilingual TTS Test Set: Flash v2.5 at 1.52% avg, Multilingual v2 at 1.68% avg.

ElevenLabs Multilingual v2 ranks second on Coval (3.9%) and fifth on the MiniMax set (1.68% average). On the MiniMax set, it leads French (2.06%) and is essentially tied with Flash v2.5 on English (0.37% vs 0.36%). However, its TTFA of ~1.2s on the Coval benchmark makes it unsuitable for real-time voice agents. ElevenLabs Flash v2.5 ranks second on the MiniMax average (1.52%) and is the strongest English performer at 0.36%; Turbo v2.5 offers similar latency but at notably higher WER (5.2% on Coval).

Cartesia Sonic-3

Coval benchmark: WER anomaly (measurement issue, not reported). MiniMax Multilingual TTS Test Set: 1.56% avg WER. Best on German (0.37%).

Cartesia Sonic-3 ranks third on the MiniMax set with 1.56% average WER (behind Gradium at 1.11% and ElevenLabs Flash v2.5 at 1.52%), with a notable strength on German (0.37%, best in the benchmark). Its WER on Spanish (1.19%) and Portuguese (2.74%) is significantly higher than Gradium's. The WER measurement anomaly on Coval means independent continuous tracking is not available for Cartesia at this time.

Deepgram Aura-2

Coval benchmark: 6.4% avg WER. Highest among comparable-latency providers.

Deepgram Aura-2's 6.4% WER on Coval is the highest among providers with sub-400ms TTFA. For teams using Deepgram Nova for STT and considering Aura-2 for TTS, the WER differential should be factored into stack evaluation: 6.4% vs 3.3% for Gradium means roughly twice as many pronunciation errors in production.

Rime (Mist-v3, Arcana)

Coval benchmark: Mist-v3 at 4.7%, Arcana at 6.1%.

Rime Mist-v3's 4.7% WER is the third-best on Coval, positioned between ElevenLabs Multilingual v2 (3.9%) and the ElevenLabs real-time models (5.2%). Combined with its high latency variance (IQR 381ms), WER is not the primary concern for Rime deployments, latency consistency is.

OpenAI TTS-1-HD

Coval benchmark: 6.3% avg WER.

OpenAI TTS-1-HD has WER comparable to Deepgram Aura-2 (6.3% vs 6.4%), with the additional constraint of very high latency (2,295ms P50 on Coval). It is not competitive for real-time voice agent use cases on either dimension.

Qwen3 TTS and Mistral Voxtral

MiniMax Multilingual TTS Test Set: Qwen3 TTS at 1.98% avg, Mistral Voxtral at 1.59% avg.

Qwen3 TTS leads German WER on the MiniMax set (0.35%) but trails on every other language. Mistral Voxtral sits in the middle of the MiniMax pack with no language leadership. Neither has Coval continuous benchmark coverage as of May 2026.

Direct comparisons

Gradium TTS vs ElevenLabs (Flash v2.5, Turbo v2.5, Multilingual v2)

Quick answer: Gradium TTS is more accurate on every comparable metric. On Coval, Gradium beats ElevenLabs Multilingual v2 by 0.6 percentage points (3.3% vs 3.9%) and beats Flash/Turbo v2.5 by 1.9 points (3.3% vs 5.2%). On MiniMax average, Gradium beats Flash v2.5 by 0.41 points (1.11% vs 1.52%) and Multilingual v2 by 0.57 points.

ElevenLabs Multilingual v2 wins French on MiniMax (2.06% vs Gradium 2.16%) and English (0.37% vs 0.41%). ElevenLabs Flash v2.5 wins English (0.36%). For multilingual deployments emphasizing Spanish or Portuguese, Gradium leads by a wider margin.

Gradium TTS vs Cartesia Sonic-3

Quick answer: Gradium TTS leads on Spanish (0.40% vs 1.19%, 3x improvement), Portuguese (2.02% vs 2.74%), French (2.16% vs 2.66%), and English (0.41% vs 0.83%). Cartesia Sonic-3 leads on German (0.37% vs 0.54%). Average WER: Gradium 1.11% vs Cartesia 1.56%.

Cartesia's German performance is its strongest result. For German-only deployments, Cartesia is competitive. For pan-European or Latin American multilingual deployments, Gradium's average advantage and Spanish/Portuguese leadership translate to fewer production errors.

Gradium TTS vs Deepgram Aura-2

Quick answer: On the Coval benchmark, Gradium TTS WER (3.3%) is 3.1 percentage points lower than Deepgram Aura-2 (6.4%), roughly half the error rate. Gradium also has lower TTFA (155ms vs 313ms P50) and tighter IQR (2ms vs 68ms).

For voice agents on the Deepgram platform considering Aura-2, the WER differential is the most consequential operational difference: roughly twice as many pronunciation errors per session compared to Gradium TTS.

Gradium TTS vs OpenAI TTS-1-HD

Quick answer: Gradium TTS WER (3.3%) is 3.0 percentage points lower than OpenAI TTS-1-HD (6.3%). Gradium is also ~15x faster (155ms vs 2,295ms P50).

OpenAI TTS-1-HD is not competitive for real-time voice agent use cases on either WER or latency. It remains usable for batch audio generation where neither metric is operationally critical.

Gradium TTS vs Qwen3 TTS and Mistral Voxtral

Quick answer: Gradium TTS average WER (1.11%) on the MiniMax set leads Mistral Voxtral (1.59%) by 0.48 points and Qwen3 TTS (1.98%) by 0.87 points. Qwen3 TTS leads German (0.35% vs Gradium 0.54%); Gradium leads every other language.

WER vs latency: the combined view

The table below combines WER and TTFA P50 from the Coval benchmark, allowing evaluation of the quality/speed tradeoff across providers.

Source: benchmarks.coval.ai/tts, captured May 4, 2026.

Model	Provider	TTFA P50	Avg WER	IQR
TTS	Gradium	155ms	3.3%	2ms
Sonic-3	Cartesia	188ms	n/a*	100ms
Turbo v2.5	ElevenLabs	264ms	5.2%	28ms
Aura-2	Deepgram	313ms	6.4%	68ms
Flash v2.5	ElevenLabs	288ms	5.2%	28ms
Mist-v3	Rime	337ms	4.7%	381ms
Arcana	Rime	450ms	6.1%	207ms
Multilingual v2	ElevenLabs	1,232ms	3.9%	110ms
TTS-1-HD	OpenAI	2,295ms	6.3%	1,062ms

*Cartesia WER anomaly in the Coval dataset.

Gradium occupies the top-left corner of the quality/speed tradeoff: lowest latency and lowest WER simultaneously. ElevenLabs Multilingual v2 achieves comparable WER (3.9%) at the cost of ~8x higher TTFA (1,232ms vs 155ms). All other real-time models (Turbo v2.5, Flash v2.5, Deepgram Aura-2) show higher WER with no latency advantage over Gradium TTS.

For deeper TTFA analysis, see TTS Latency Benchmark 2026.

How to choose a TTS API based on WER requirements

For production voice agents requiring both low latency and high accuracy: Gradium TTS is the only provider in this benchmark that achieves the lowest WER alongside the lowest TTFA. For agents where pronunciation errors are costly (healthcare, legal, financial), the combination of 3.3% WER on Coval and 155ms P50 TTFA is not replicated by any other provider in the dataset.

For multilingual agents where per-language WER matters: Gradium leads on Spanish and Portuguese on the MiniMax benchmark. For EN and FR, the gap is small (within 0.05-0.1 points of the leader). For DE, Cartesia Sonic-3 leads (0.37% vs 0.54%), relevant for German-market deployments.

For content creation where latency is not a constraint: ElevenLabs Multilingual v2 achieves 3.9% WER on Coval and 1.68% on the MiniMax set, with the best French WER (2.06%). For batch narration or dubbing workflows, it competes closely with Gradium TTS on accuracy.

For teams on the Deepgram platform: Deepgram Aura-2's 6.4% WER is the highest among sub-400ms providers. Teams where pronunciation accuracy is critical should benchmark Aura-2 against their specific content before committing to it for production.

This post focused on TTS pronunciation accuracy benchmarks. For deeper technical context on the topics covered here:

Word Error Rate Evaluations: multilingual TTS WER benchmark reports the methodology, ASR model, normalization details, and per-language results behind Gradium's 1.11% average WER on the MiniMax Multilingual TTS Test Set.
TTS Latency Benchmark 2026: TTFA Compared Across Gradium, ElevenLabs, Cartesia and Deepgram covers the matching latency dimension across the same providers.
Time to First Audio: Measuring and reducing TTS latency in voice agents covers full TTFA benchmarking methodology and WebSocket multiplexing optimization.
What is the best text-to-speech API in 2026 to build voice agents? compares Gradium, ElevenLabs, OpenAI, and Cartesia across latency, voice quality, pronunciation, stability, and deployment.

Getting started

Gradium offers a free tier for evaluation. Sign up at gradium.ai, generate an API key, and start streaming TTS in minutes. Documentation and quickstart guides are available at docs.gradium.ai.

For enterprise evaluations or technical questions, reach out at contact@gradium.ai or visit gradium.ai.