Gradium vs ElevenLabs for Voice Agents: TTFA, WER and IQR Compared (2026 Coval Data)

16 min read

TL;DR: For real-time voice agents in 2026, Gradium TTS leads ElevenLabs across the three metrics that matter: 155ms P50 TTFA vs 264ms (ElevenLabs Turbo v2.5), 2ms IQR vs 28ms (14x more consistent), and 3.3% average WER vs 5.2% (Turbo and Flash v2.5) on the independent Coval TTS benchmark (data captured May 4, 2026). On the MiniMax Multilingual TTS Test Set across English, French, Spanish, Portuguese, and German, Gradium TTS averages 1.11% WER vs 1.52% for ElevenLabs Flash v2.5 and 1.68% for Multilingual v2. ElevenLabs leads on two specific results worth noting: Multilingual v2 has the lowest French WER (2.06% vs 2.16%) and Flash v2.5 has the lowest English WER by 0.05 points (0.36% vs 0.41%). ElevenLabs Multilingual v2 has comparable WER to Gradium on Coval (3.9%) but at 1,232ms P50 latency, unsuitable for real-time use. Gradium is also approximately 3-4x less expensive than ElevenLabs for comparable TTS volume. ElevenLabs covers 32 languages vs Gradium's 5 (EN, FR, ES, PT, DE).

Key takeaways

  1. Lowest TTFA: Gradium TTS at 155ms P50 on Coval, 109ms faster than ElevenLabs Turbo v2.5 (264ms) and 133ms faster than Flash v2.5 (288ms).
  2. Lowest IQR (most consistent latency): Gradium TTS at 2ms IQR, 14x tighter than ElevenLabs Turbo v2.5 and Flash v2.5 (28ms each), 55x tighter than Multilingual v2 (110ms).
  3. Lowest WER on Coval: Gradium TTS at 3.3%, vs ElevenLabs Multilingual v2 (3.9%), Flash v2.5 (5.2%) and Turbo v2.5 (5.2%).
  4. Lowest multilingual WER on MiniMax set: Gradium TTS at 1.11% average across 5 languages, ahead of ElevenLabs Flash v2.5 (1.52%) and Multilingual v2 (1.68%).
  5. ElevenLabs leads on two specific languages: Flash v2.5 leads English (0.36% vs Gradium 0.41%), Multilingual v2 leads French (2.06% vs Gradium 2.16%).
  6. Gradium leads Spanish (0.40% vs 0.99%) and Portuguese (2.02% vs 3.18%) vs ElevenLabs Flash v2.5 by wider margins.
  7. No real-time tradeoff for Gradium: lowest TTFA and lowest WER hold simultaneously. ElevenLabs forces a tradeoff: Multilingual v2 has competitive WER but 1,232ms P50 latency.
  8. Pricing: Gradium is approximately 3-4x less expensive than ElevenLabs for comparable TTS volume.
  9. Language coverage tradeoff: ElevenLabs supports 32 languages vs Gradium's 5 (EN, FR, ES, PT, DE).

Bottom line for voice agents

Quick verdict: For real-time voice agents in 2026, Gradium TTS is the better choice than any ElevenLabs model on the metrics that determine production user experience: TTFA, IQR, and WER. ElevenLabs is the better choice when (a) the product needs language coverage beyond English, French, Spanish, Portuguese, or German, or (b) the use case is batch content creation rather than real-time conversation. For pricing, Gradium is approximately 3-4x cheaper at comparable volume.

At a glance: Gradium vs ElevenLabs (Coval, May 4, 2026)

Gradium and ElevenLabs are two of the most frequently evaluated TTS providers by teams building real-time voice agents. They start from different positions: ElevenLabs is the established leader in voice quality for content creation, with three distinct models covering different latency and quality profiles. Gradium is a newer entrant built specifically for real-time voice agent infrastructure, with a single model optimized for streaming latency and pronunciation accuracy.

Metric Gradium TTS ElevenLabs Turbo v2.5 ElevenLabs Flash v2.5 ElevenLabs Multilingual v2
TTFA P50 (Coval) 155ms 264ms 288ms 1,232ms
TTFA P75 (Coval) 156ms 279ms 304ms 1,288ms
IQR (Coval) 2ms 28ms 28ms 110ms
Avg WER (Coval) 3.3% 5.2% 5.2% 3.9%
Avg WER multilingual (MiniMax set) 1.11% n/a 1.52% 1.68%
Languages 5 (EN, FR, DE, ES, PT) 32 32 32
Pricing vs Gradium Reference ~3-4x more expensive ~3-4x more expensive ~3-4x more expensive

This comparison uses independent benchmark data from Coval (benchmarks.coval.ai/tts) and Gradium's own published benchmarks (Time to First Audio, Word Error Rate Evaluations) to compare the two providers across the three metrics that determine real-time voice agent performance: TTFA (Time to First Audio), WER (Word Error Rate), and latency IQR (consistency).

TTFA: latency comparison

Quick answer: On Coval, Gradium TTS delivers 155ms P50 TTFA, 109ms faster than ElevenLabs Turbo v2.5 (264ms), 133ms faster than Flash v2.5 (288ms), and 1,077ms faster than Multilingual v2 (1,232ms). The advantage holds at every percentile measured.

Coval independent benchmark

On the Coval independent TTS benchmark, Gradium TTS achieves 155ms P50 TTFA, the fastest result among all 9 models tested, including all three ElevenLabs models.

  • ElevenLabs Turbo v2.5: 264ms P50 (+109ms vs Gradium)
  • ElevenLabs Flash v2.5: 288ms P50 (+133ms vs Gradium)
  • ElevenLabs Multilingual v2: 1,232ms P50 (+1,077ms vs Gradium)

The gap is consistent across percentiles. At P75:

  • Gradium TTS: 156ms
  • ElevenLabs Turbo v2.5: 279ms (+123ms)
  • ElevenLabs Flash v2.5: 304ms (+148ms)

ElevenLabs Flash v2.5 is positioned by ElevenLabs as their low-latency real-time model. In the Coval benchmark, it is actually slightly slower than Turbo v2.5 at both P50 (288ms vs 264ms) and P75 (304ms vs 279ms).

Gradium self-reported benchmark

Gradium published a controlled TTFA benchmark in March 2026, measured from the Paris office using WebSocket APIs with documented methodology (100 queries per model, first 5 discarded, controlled network latency). Source: Time to First Audio.

Standard WebSocket (with connection establishment):

Model P25 P50 P75 P95
Gradium 255ms 258ms 263ms 274ms
ElevenLabs Turbo v2.5 294ms 304ms 311ms 324ms
ElevenLabs Flash v2.5 317ms 324ms 333ms 351ms
ElevenLabs Multilingual v2 690ms 706ms 720ms 742ms

With WebSocket multiplexing (no per-turn connection overhead):

Model P25 P50 P75 P95
Gradium 212ms 214ms 219ms 228ms
ElevenLabs Turbo v2.5 248ms 257ms 263ms 278ms
ElevenLabs Flash v2.5 271ms 277ms 284ms 302ms
ElevenLabs Multilingual v2 643ms 657ms 672ms 688ms

WebSocket multiplexing uses a persistent connection to eliminate the ~50ms per-turn connection overhead. With multiplexing, Gradium reaches 214ms P50 and ElevenLabs Turbo v2.5 reaches 257ms P50. Gradium's TTFA advantage holds across both scenarios.

Both the Coval and Gradium benchmarks are consistent in their relative ranking: Gradium is faster than all three ElevenLabs models at every measured percentile. The absolute values differ between benchmarks due to different measurement infrastructure, network conditions, and text inputs, which is expected and documented.

IQR: latency consistency

Quick answer: On Coval, Gradium TTS has 2ms IQR (P25 154ms, P75 156ms). ElevenLabs Turbo v2.5 and Flash v2.5 have 28ms IQR (14x wider). Multilingual v2 has 110ms IQR (55x wider). Lower IQR means more uniform user experience across thousands of concurrent calls.

The IQR (interquartile range) measures the spread of latency values between P25 and P75, a direct indicator of how predictable TTS response time is in production.

Source: Coval TTS benchmark, captured May 4, 2026.

Model P25 P50 P75 IQR Std Dev
Gradium TTS 154ms 155ms 156ms 2ms 80ms
ElevenLabs Turbo v2.5 251ms 264ms 279ms 28ms 39ms
ElevenLabs Flash v2.5 276ms 288ms 304ms 28ms 40ms
ElevenLabs Multilingual v2 1,178ms 1,232ms 1,288ms 110ms n/a

Gradium's IQR of 2ms means 50% of all requests complete within a 2ms window. P25 and P75 are virtually identical (154ms and 156ms). This is near-deterministic latency.

ElevenLabs Turbo v2.5 and Flash v2.5 both show 28ms IQR, 14 times wider than Gradium. In absolute terms this is moderate, but it means users will notice latency variation across turns in a conversation. At P75 of 279ms (Turbo) or 304ms (Flash), a significant fraction of turns approaches or exceeds the 300ms conversational threshold.

ElevenLabs Multilingual v2 shows 110ms IQR at latencies that are already far above real-time thresholds.

Why IQR matters more than P50 in production: A voice agent handling thousands of concurrent sessions will have some sessions consistently at P75 or P95, not just P50. With Gradium's 2ms IQR, the user experience is uniform. With ElevenLabs' 28ms IQR, a meaningful portion of turns will be noticeably slower than the typical response.

WER: pronunciation accuracy

Quick answer: On Coval, Gradium TTS averages 3.3% WER vs ElevenLabs Multilingual v2 at 3.9% (real-time-unviable at 1,232ms latency), Turbo v2.5 at 5.2%, and Flash v2.5 at 5.2%. On the MiniMax Multilingual TTS Test Set across 5 languages, Gradium averages 1.11% vs ElevenLabs Flash v2.5 at 1.52% and Multilingual v2 at 1.68%.

Coval WER ranking

Source: benchmarks.coval.ai/tts, captured May 4, 2026.

Model Avg WER (Coval)
Gradium TTS 3.3%
ElevenLabs Multilingual v2 3.9%
ElevenLabs Flash v2.5 5.2%
ElevenLabs Turbo v2.5 5.2%

Gradium achieves the lowest WER (3.3%). ElevenLabs Multilingual v2 is second at 3.9%, but at P50 latency of 1,232ms it is not usable for real-time voice agents. Among the real-time ElevenLabs models (Turbo v2.5 and Flash v2.5), WER is 5.2% on both, approximately 58% higher than Gradium (3.3%).

Multilingual WER: MiniMax TTS Test Set

Source: Word Error Rate Evaluations. WER (%) per language. Bold = best per language among models compared here.

Model Avg EN FR ES PT DE
Gradium 1.11 0.41 2.16 0.40 2.02 0.54
ElevenLabs Flash v2.5 1.52 0.36 2.45 0.99 3.18 0.61
ElevenLabs Multilingual v2 1.68 0.37 2.06 1.93 3.34 0.72

Gradium leads on average (1.11% vs 1.52% for ElevenLabs Flash v2.5) and on Spanish, Portuguese, and German within this Gradium-vs-ElevenLabs comparison. ElevenLabs Flash v2.5 leads narrowly on English (0.36% vs 0.41% for Gradium, a gap of 0.05 percentage points). ElevenLabs Multilingual v2 leads on French (2.06% vs 2.16% for Gradium).

For teams building multilingual agents covering EN, FR, ES, PT, and DE: Gradium produces the best average WER across the five languages, with the clearest advantages on Spanish (0.40% vs 0.99%, ~2.5x lower error rate) and Portuguese (2.02% vs 3.18%, ~36% lower).

The combined picture: latency and accuracy together

Quick answer: Gradium TTS is the only provider on the Coval benchmark that achieves both the lowest TTFA (155ms P50) and the lowest WER (3.3%) simultaneously. Every ElevenLabs model forces a tradeoff: real-time speed (Turbo, Flash) at higher WER, or comparable WER (Multilingual v2) at non-real-time latency.

The question most teams building voice agents actually ask is: which provider gives the best accuracy without sacrificing response time?

On the Coval benchmark, Gradium is the only provider that achieves both the lowest TTFA (155ms P50) and the lowest WER (3.3%) simultaneously. No ElevenLabs model achieves both:

  • ElevenLabs Multilingual v2 achieves near-comparable WER (3.9%) but at 1,232ms P50, eight times slower than Gradium and far above the real-time threshold.
  • ElevenLabs Turbo v2.5 achieves real-time-viable latency (264ms) but at 5.2% WER, approximately 58% higher than Gradium.
  • ElevenLabs Flash v2.5 is positioned as the low-latency option but shows 288ms P50 and 5.2% WER on Coval, slower than Turbo v2.5 on both.

This means teams using ElevenLabs for real-time voice agents face a forced tradeoff: either accept higher WER (Turbo/Flash) or accept latency that is 8x higher (Multilingual v2). Gradium does not require this tradeoff.

Language coverage

Quick answer: ElevenLabs supports 32 languages across all three models. Gradium TTS supports 5 (English, French, Spanish, Portuguese, German). For products targeting any of those 5 languages, Gradium leads on average WER and latency. For products requiring broader coverage, ElevenLabs is the practical choice.

Gradium supports 5 languages: English, French, Spanish, Portuguese, and German. All five have documented WER measurements on the MiniMax Multilingual TTS Test Set.

ElevenLabs supports 32 languages across all three models. For products targeting markets outside Gradium's 5 supported languages, ElevenLabs is the option that provides broad coverage with documented TTS quality.

For products targeting EN, FR, ES, PT, or DE specifically: Gradium's per-language WER is lower on average, and its real-time latency performance (155ms P50, 2ms IQR) is superior to all ElevenLabs models in the Coval benchmark.

Pricing

Quick answer: Gradium is approximately 3-4x less expensive than ElevenLabs for comparable TTS volume. See the pricing page for plan details. All Gradium plans include voice cloning and WebSocket streaming.

For teams scaling voice agent deployments in production, the 3-4x pricing advantage compounds with the latency and WER advantages: lower cost per interaction, faster responses, and fewer pronunciation errors in the same package.

When to choose Gradium vs ElevenLabs

When Gradium is the right choice

Real-time voice agents in EN, FR, ES, PT, or DE: Gradium delivers the lowest TTFA (155ms P50 on Coval, 258ms on Gradium's own benchmark), the lowest WER (3.3% on Coval, 1.11% on MiniMax), and the most consistent latency (IQR 2ms) among all providers in the Coval benchmark. At 3-4x lower pricing than ElevenLabs, the production economics are favorable at scale.

Products where both latency and pronunciation accuracy matter: Gradium is the only provider in the Coval benchmark that achieves both the lowest latency and the lowest WER simultaneously. No ElevenLabs model replicates this combination.

High-concurrency deployments: Gradium supports 1,000+ concurrent sessions with 99.9% uptime SLA and cloud, dedicated, self-hosted, and on-premises deployment options including HIPAA-compliant infrastructure.

When ElevenLabs is the right choice

Languages beyond EN, FR, ES, PT, DE: ElevenLabs supports 32 languages. For products requiring broad language coverage outside Gradium's 5 supported languages, ElevenLabs is the primary option with documented TTS quality.

Content creation and narration: ElevenLabs Multilingual v2's voice quality and voice library make it well suited for batch audio generation (audiobooks, dubbing, narration) where latency is not a constraint and voice naturalness is the primary criterion.

Large pre-built voice library requirements: ElevenLabs offers an extensive catalog of pre-built voices across languages. For products requiring a wide selection of public voices without custom cloning, ElevenLabs' library is broader.

Getting started

Gradium offers a free tier for evaluation. Sign up at gradium.ai, generate an API key, and start streaming TTS in minutes. Documentation and quickstart guides are available at docs.gradium.ai.

For enterprise evaluations or technical questions, reach out at contact@gradium.ai or visit gradium.ai.

Frequently Asked Questions