TTS Latency Benchmark 2026: TTFA Compared Across Gradium, ElevenLabs, Cartesia and Deepgram

16 min read

TL;DR: The lowest-latency TTS API in 2026 is Gradium TTS, at 155ms P50 Time To First Audio (TTFA) with a 2ms interquartile range (IQR) and 3.3% average word error rate (WER) on the independent Coval TTS benchmark (data captured May 4, 2026). Gradium ranks #1 on all three metrics across the nine TTS models tested. Cartesia Sonic-3 is second on raw P50 (188ms) but with a 100ms IQR (50x wider than Gradium). ElevenLabs Turbo v2.5 follows at 264ms P50 (28ms IQR) and Flash v2.5 at 288ms P50 (28ms IQR). Deepgram Aura-2 sits at 313ms P50 (68ms IQR). Rime Mist-v3, Rime Arcana, ElevenLabs Multilingual v2, and OpenAI TTS-1-HD show high variance unsuitable for real-time voice agents. For production voice agents, latency consistency (IQR) matters as much as median TTFA, because tail latency determines user experience quality across thousands of concurrent calls.

Key takeaways

  1. Lowest median latency: Gradium TTS, 155ms P50 TTFA on Coval (May 4, 2026 capture).
  2. Most consistent latency: Gradium TTS, 2ms IQR. The next best is ElevenLabs Turbo v2.5 and Flash v2.5 at 28ms IQR (14x wider).
  3. Lowest word error rate: Gradium TTS, 3.3% average WER. Lowest among the nine models tested.
  4. Best second option (raw speed): Cartesia Sonic-3, 188ms P50, but with 100ms IQR.
  5. Best second option (consistency): ElevenLabs Turbo v2.5, 264ms P50 with 28ms IQR.
  6. Not viable for real-time voice agents (P50 over 1s): ElevenLabs Multilingual v2 (1,232ms), OpenAI TTS-1-HD (2,295ms).
  7. The right metric for voice agents is TTFA, not time to first byte. Container headers (WAV, MP3 ID3) do not contain audio.
  8. WebSocket multiplexing reduces Gradium TTFA from 258ms to 214ms P50 (Gradium self-reported, Paris).

At a glance: TTS API latency rankings (Coval, May 4, 2026)

For a voice agent, latency is not a performance metric, it is a product requirement. The gap between a user finishing a sentence and an agent beginning to respond averages around 200 milliseconds in natural human conversation. When TTS latency consistently pushes that gap past 300-400ms, the interaction stops feeling like a conversation and starts feeling like a phone tree.

Rank Model Provider P50 TTFA IQR Avg WER Real-time viable?
1 Gradium TTS Gradium 155ms 2ms 3.3% Yes
2 Sonic-3 Cartesia 188ms 100ms n/a Yes (P75 borderline)
3 Turbo v2.5 ElevenLabs 264ms 28ms 5.2% Yes
4 Flash v2.5 ElevenLabs 288ms 28ms 5.2% Yes
5 Aura-2 Deepgram 313ms 68ms 6.4% Yes (borderline)
6 Mist-v3 Rime 337ms 381ms 4.7% Marginal
7 Arcana Rime 450ms 207ms 6.1% No
8 Multilingual v2 ElevenLabs 1,232ms 110ms 3.9% No (batch only)
9 TTS-1-HD OpenAI 2,295ms 1,062ms 6.3% No (batch only)

This benchmark compares the leading TTS APIs on the metric that determines this experience: Time to First Audio (TTFA). It draws from two sources: the independent Coval TTS benchmark, which continuously tests production TTS endpoints, and Gradium's own published benchmark (Time to First Audio: Measuring and reducing TTS latency in voice agents), which documents controlled methodology and P25 through P95 results. Providers covered: Gradium, ElevenLabs (Turbo v2.5, Flash v2.5, Multilingual v2), Cartesia Sonic-3, Deepgram Aura-2, Rime (Mist-v3, Arcana), and OpenAI TTS-1-HD.

What is TTFA and why it is the right metric for voice agents

Quick answer: Time to First Audio (TTFA) is the elapsed time between sending a TTS request and receiving the first playable audio sample. It is the only latency metric that correlates with the user-perceived responsiveness of a real-time voice agent.

Time to First Audio (TTFA) is distinct from total synthesis time (the time to generate the full audio) and from time to first byte (which includes container headers that carry no audio content).

For voice agents, TTFA is the correct latency metric because:

  • Audio playback starts as soon as the first chunk arrives. The agent can begin speaking while the rest of the sentence is still being synthesized.
  • Total synthesis time is irrelevant for streaming TTS: the user hears audio before generation completes.
  • Time to first byte is misleading: some providers return WAV headers or MP3 ID3 tags within milliseconds while the first actual audio arrives much later. A benchmark reporting time to first byte captures metadata delivery, not speech delivery.

Measuring TTFA correctly requires parsing past the container format. For WAV, discard the 44-byte header. For Ogg/Opus, skip identification and comment header pages. For MP3, skip ID3 tags and detect the first valid MPEG audio frame.

Why latency consistency matters as much as the median

Quick answer: P50 (median) describes typical latency. P75, P95, and IQR describe what users feel when the system is under load. A low P50 with a high P95 produces an inconsistent product experience.

For production voice agents handling thousands of concurrent sessions, a low P50 with a high P95 creates an inconsistent user experience: most calls feel fast, but a meaningful percentage feel broken.

The interquartile range (IQR), the gap between P25 and P75, is a direct measure of latency consistency. A low IQR means the TTS API delivers predictable latency regardless of load variation. A high IQR means latency spikes occur regularly in production.

What is a good TTFA for a voice agent in 2026?

Quick answer: Below 300ms P50 with a tight P25-P95 spread. Below 200ms P50 is excellent. Above 400ms P50 starts to feel non-conversational.

Natural human conversation has a modal turn-taking gap of around 200ms. End-to-end voice agent latency includes STT, LLM, tool calls, and TTS. TTS sits at the end of this chain, so its TTFA is added directly to user-perceived latency. Gradium TTS at 155ms P50 leaves headroom for the upstream stages.

Benchmark sources and methodology

Coval independent benchmark

Coval is an independent voice AI evaluation platform that continuously benchmarks production TTS endpoints. The TTS benchmark measures TTFA in real conditions, reporting P25, P50, P75, IQR, mean, median, and standard deviation across hundreds to over a thousand runs per model. Coval is not affiliated with any TTS provider.

The Coval TTS benchmark refreshes approximately every 30 minutes, so the values shown on the dashboard track current production performance rather than a frozen snapshot. The figures reported in this post were captured from the Coval dashboard on May 4, 2026.

As of May 2026, the Coval TTS benchmark includes 9 models: Gradium TTS, Cartesia Sonic-3, ElevenLabs Turbo v2.5, ElevenLabs Flash v2.5, ElevenLabs Multilingual v2, Deepgram Aura-2, Rime Mist-v3, Rime Arcana, and OpenAI TTS-1-HD.

Gradium self-reported benchmark

Gradium published a controlled TTFA benchmark in March 2026 (Time to First Audio) with documented methodology:

  • WebSocket APIs used for all providers (HTTP POST for OpenAI, which has no WebSocket API)
  • Standardized input text of 15-25 words
  • Same output format and sample rate across all providers
  • Measurements from the Gradium Paris office
  • Network latency controlled (~5ms ping to Gradium and ElevenLabs endpoints, ~3ms to OpenAI)
  • 100 queries per model, first 5 discarded (warm state)

Results: Coval independent TTS benchmark

Model performance heatmap

All latency values in milliseconds. Source: benchmarks.coval.ai/tts, captured May 4, 2026. The Coval dashboard refreshes approximately every 30 minutes; current numbers may differ slightly.

Model Provider P25 P50 P75 IQR Avg WER
TTS Gradium 154ms 155ms 156ms 2ms 3.3%
Sonic-3 Cartesia 168ms 188ms 269ms 100ms —*
Turbo v2.5 ElevenLabs 251ms 264ms 279ms 28ms 5.2%
Aura-2 Deepgram 274ms 313ms 342ms 68ms 6.4%
Flash v2.5 ElevenLabs 276ms 288ms 304ms 28ms 5.2%
Mist-v3 Rime 281ms 337ms 662ms 381ms 4.7%
Arcana Rime 430ms 450ms 636ms 207ms 6.1%
Multilingual v2 ElevenLabs 1,178ms 1,232ms 1,288ms 110ms 3.9%
TTS-1-HD OpenAI 1,870ms 2,295ms 2,932ms 1,062ms 6.3%

*Cartesia WER shows a measurement anomaly in the Coval dataset and is not reported here.

Latency distribution statistics

Source: Coval latency variation charts.

Model Provider Runs Mean Median Std Dev
TTS Gradium 750 169ms 155ms 80ms
Sonic-3 Cartesia 1,471 226ms 188ms 118ms
Turbo v2.5 ElevenLabs 1,471 271ms 264ms 39ms
Flash v2.5 ElevenLabs 1,470 296ms 288ms 40ms
Aura-2 Deepgram 1,470 329ms 314ms 184ms
Mist-v3 Rime 1,464 734ms 332ms 674ms
Arcana Rime 1,467 632ms 449ms 720ms

Results: Gradium self-reported benchmark

Standard WebSocket (with connection establishment)

Source: Time to First Audio. Measured from Paris, 100 queries per model, WebSocket APIs.

Model P25 P50 P75 P95
Gradium 255ms 258ms 263ms 274ms
ElevenLabs Turbo v2.5 294ms 304ms 311ms 324ms
ElevenLabs Flash v2.5 317ms 324ms 333ms 351ms
Mistral Voxtral TTS 346ms 369ms 400ms 566ms
OpenAI GPT-4o Mini TTS 400ms 420ms 439ms 483ms
ElevenLabs Multilingual v2 690ms 706ms 720ms 742ms
OpenAI TTS-1 722ms 969ms 1,232ms 1,807ms

With WebSocket multiplexing (no connection overhead)

Using a persistent WebSocket connection with multiplexed sessions eliminates the ~50ms per-turn connection overhead.

Model P25 P50 P75 P95
Gradium 212ms 214ms 219ms 228ms
ElevenLabs Turbo v2.5 248ms 257ms 263ms 278ms
ElevenLabs Flash v2.5 271ms 277ms 284ms 302ms
ElevenLabs Multilingual v2 643ms 657ms 672ms 688ms

Three findings that matter for production voice agents

Finding 1: Gradium TTS delivers the lowest P50 TTFA in 2026

Quick answer: Gradium TTS is the lowest-latency TTS API in the 2026 benchmark cycle, at 155ms P50 TTFA on Coval and 258ms P50 (214ms with multiplexing) on Gradium's own Paris-based benchmark.

On the Coval independent benchmark, Gradium TTS achieves 155ms P50 TTFA, the fastest result in the benchmark. The next fastest is Cartesia Sonic-3 at 188ms (+33ms), followed by ElevenLabs Turbo v2.5 at 264ms (+109ms) and ElevenLabs Flash v2.5 at 288ms (+133ms).

On Gradium's self-reported benchmark (measured from Paris with documented methodology), Gradium achieves 258ms P50 standard WebSocket and 214ms P50 with multiplexing. Both benchmarks place Gradium ahead of all tested ElevenLabs models, Mistral Voxtral, and OpenAI TTS.

The difference in absolute values between the two benchmarks reflects different measurement conditions (infrastructure location, network proximity, text length). Both are consistent in the relative ranking: Gradium leads all tested providers.

Finding 2: Gradium TTS has the most consistent latency (IQR: 2ms)

Quick answer: Gradium TTS has a 2ms IQR on Coval, 14x tighter than ElevenLabs (28ms), 34x tighter than Deepgram (68ms), 50x tighter than Cartesia (100ms), and 531x tighter than OpenAI TTS-1-HD (1,062ms).

The most operationally significant result in the Coval benchmark is not the P50, it is the IQR of 2ms.

Latency IQR measures the spread between P25 and P75. A 2ms IQR means that 50% of all Gradium requests complete within a 2ms window centered on the median. This is near-deterministic latency.

By comparison:

  • ElevenLabs Turbo v2.5: 28ms IQR (14x wider than Gradium)
  • ElevenLabs Flash v2.5: 28ms IQR (14x wider)
  • Deepgram Aura-2: 68ms IQR (34x wider)
  • Cartesia Sonic-3: 100ms IQR (50x wider)
  • Rime Mist-v3: 381ms IQR (190x wider)
  • OpenAI TTS-1-HD: 1,062ms IQR (531x wider)

For production voice agents, latency consistency determines user experience quality more than median latency. A median of 155ms with 2ms IQR means the vast majority of turns feel identical. A median of 264ms with 28ms IQR means users notice variation. At 381ms IQR, latency spikes are a visible UX problem.

Finding 3: Gradium TTS has the lowest average WER

Quick answer: Gradium TTS achieves 3.3% average WER on Coval, the lowest of any TTS model in the benchmark. Lowest TTFA and lowest WER hold simultaneously.

On the Coval TTS benchmark, Gradium TTS achieves 3.3% average WER, the lowest of all providers in the benchmark. The ranking:

  1. Gradium TTS: 3.3%
  2. ElevenLabs Multilingual v2: 3.9%
  3. Rime Mist-v3: 4.7%
  4. ElevenLabs Flash v2.5: 5.2%
  5. ElevenLabs Turbo v2.5: 5.2%
  6. Rime Arcana: 6.1%
  7. OpenAI TTS-1-HD: 6.3%
  8. Deepgram Aura-2: 6.4%

This result is consistent with Gradium's own multilingual WER benchmark published on April 29, 2026 (Word Error Rate Evaluations), which reports 1.11% average WER on the MiniMax Multilingual TTS Test Set across five languages (EN, FR, ES, PT, DE), the lowest average across all providers in that benchmark as well.

A provider that achieves the lowest TTFA and the lowest WER simultaneously is not making a quality/speed tradeoff. Both metrics move together because Gradium's DSM (Delayed Streams Modeling) architecture is designed to stream high-quality audio from the first chunk rather than buffering for quality.

Provider-by-provider TTS latency analysis

Gradium TTS

Coval benchmark: P50 155ms, IQR 2ms, WER 3.3%. #1 on all three metrics among the 9 models tested.

Gradium is a real-time voice AI platform whose architecture is built on Kyutai's research (Delayed Streams Modeling, arXiv:2509.08753). The DSM architecture enables batched generation while preserving streaming capabilities, combined with CUDA graph optimization and configurable codebook depth (8, 16, or 32 codebooks) for quality/latency tradeoffs.

Gradium supports WebSocket multiplexing: a persistent connection handles multiple sessions without per-turn connection overhead, reducing TTFA from ~258ms to ~214ms P50 in production deployments. This is documented in Gradium's API at docs.gradium.ai/guides/multiplexing.

Deployment options: Cloud API (multiple regions), inference partner deployments, dedicated instances, self-hosted, and on-premises (HIPAA compliant).

Pricing: see the pricing page for plan details. All plans include voice cloning and WebSocket streaming.

ElevenLabs (Turbo v2.5, Flash v2.5, Multilingual v2)

Coval benchmark: Turbo v2.5 at P50 264ms, IQR 28ms, WER 5.2%. Flash v2.5 at P50 288ms, IQR 28ms, WER 5.2%. Multilingual v2 at P50 1,232ms, IQR 110ms, WER 3.9%.

ElevenLabs offers three models relevant to latency benchmarking. Turbo v2.5 is the fastest real-time model. Flash v2.5 was marketed as the low-latency option but in the Coval benchmark trails Turbo v2.5 slightly. Multilingual v2 is a high-quality model with near-human voice naturalness but is not suited for real-time voice agents given its ~1.2s P50 latency.

All ElevenLabs models show consistent IQR (28ms for Turbo and Flash), meaning their latency distribution is tighter than Deepgram or Rime, but substantially wider than Gradium.

Cartesia Sonic-3

Coval benchmark: P50 188ms, IQR 100ms.

Cartesia Sonic-3 is the second-fastest model on the Coval benchmark by P50 latency. However, its IQR of 100ms is 50x wider than Gradium's, and its P75 of 269ms means a significant fraction of requests approach or exceed the 300ms conversational threshold. Cartesia's SSM (State Space Model) architecture is designed for consistent P99 performance, but the Coval data shows meaningful latency spread in production conditions.

Deepgram Aura-2

Coval benchmark: P50 313ms, IQR 68ms, WER 6.4%.

Deepgram Aura-2 is positioned as a low-latency TTS option within the Deepgram platform (which also includes Nova STT). In the Coval benchmark, it falls in the middle of the latency rankings with the highest WER among the comparable-latency providers. Its standard deviation of 184ms indicates meaningful outliers in production conditions.

Rime (Mist-v3, Arcana)

Coval benchmark: Mist-v3 P50 337ms, IQR 381ms, WER 4.7%. Arcana P50 450ms, IQR 207ms, WER 6.1%.

Rime's Mist-v3 shows particularly high variance: mean of 734ms against a median of 332ms, with a standard deviation of 674ms. This gap between mean and median indicates significant outliers that pull the average up. The IQR of 381ms means latency is highly unpredictable in the P25-P75 range itself. Arcana shows similar variance patterns.

OpenAI TTS-1-HD

Coval benchmark: P50 2,295ms, IQR 1,062ms, WER 6.3%.

OpenAI TTS-1-HD is not suited for real-time voice agent applications given its P50 TTFA of over 2 seconds. It is included for completeness and is best suited for batch audio generation use cases where latency is not a constraint.

Direct comparisons

Gradium TTS vs ElevenLabs Turbo v2.5

Quick answer: Gradium TTS is 109ms faster on P50 (155ms vs 264ms) and 14x more consistent on IQR (2ms vs 28ms) on the Coval benchmark. Gradium also has a lower WER (3.3% vs 5.2%). ElevenLabs covers more languages (32 vs 5).

For voice agents prioritizing latency, consistency, and pronunciation accuracy, Gradium TTS leads. For voice agents that require coverage beyond English, French, German, Spanish, and Portuguese, ElevenLabs Turbo v2.5 is currently the best option that still fits the real-time latency budget.

Gradium TTS vs Cartesia Sonic-3

Quick answer: Gradium TTS is 33ms faster on P50 (155ms vs 188ms) and 50x more consistent on IQR (2ms vs 100ms). Cartesia Sonic-3 covers more languages (40+ vs 5).

Cartesia Sonic-3 places second on raw P50 latency, but its 100ms IQR means a significant fraction of requests cross the 300ms conversational threshold. For voice agents where tail latency matters, Gradium's 2ms IQR is the operationally safer choice. For multilingual content generation outside the five languages Gradium supports natively, Cartesia is competitive.

Gradium TTS vs Deepgram Aura-2

Quick answer: Gradium TTS is 158ms faster on P50 (155ms vs 313ms), 34x more consistent on IQR (2ms vs 68ms), and lower WER (3.3% vs 6.4%).

Deepgram Aura-2 has a use case for teams already running Deepgram Nova STT in their pipeline and prioritizing single-vendor integration over latency. For latency-critical deployments, Gradium TTS is a clear upgrade across all three measured metrics.

Gradium TTS vs OpenAI TTS-1-HD

Quick answer: Gradium TTS is 2,140ms faster on P50 (155ms vs 2,295ms) and 531x more consistent on IQR (2ms vs 1,062ms). OpenAI TTS-1-HD is not viable for real-time voice agents.

OpenAI TTS-1-HD is appropriate for batch audio generation (narration, dubbing, podcast assembly) where latency is not a constraint. It is not appropriate for conversational voice agents.

Gradium TTS vs Rime (Mist-v3, Arcana)

Quick answer: Gradium TTS is 182ms faster than Rime Mist-v3 on P50 (155ms vs 337ms) and 190x more consistent on IQR (2ms vs 381ms). Against Rime Arcana, Gradium is 295ms faster on P50 and 103x more consistent.

Rime's variance patterns make both Mist-v3 and Arcana risky for production voice agents at scale. The mean-vs-median gap (Mist-v3 mean 734ms vs median 332ms) indicates significant outliers.

How to choose a TTS API based on latency requirements

For real-time conversational voice agents: Gradium TTS delivers the best combination of P50 TTFA (155ms on Coval, 258ms self-reported standard WebSocket) and latency consistency (IQR 2ms). The IQR advantage is particularly relevant for production deployments where tail latency determines user experience quality.

For real-time agents with broad language requirements (32+ languages): ElevenLabs Turbo v2.5 at 264ms P50 and 28ms IQR is the best option when language coverage beyond Gradium's 5 languages is required. Flash v2.5 shows slightly higher P50 (288ms) in the Coval benchmark despite its low-latency positioning.

For lowest absolute latency among content-creation-focused models: Cartesia Sonic-3 at 188ms P50 is the second-fastest on Coval but with higher variance (100ms IQR). Its broader language coverage (40+) makes it relevant for teams prioritizing multilingual coverage at competitive TTFA.

For teams already on the Deepgram platform: Deepgram Aura-2 provides STT and TTS in the same vendor, reducing integration overhead. Its P50 of 313ms is acceptable for many voice agent use cases if Deepgram Nova STT is already in the stack.

For batch content generation (narration, dubbing): ElevenLabs Multilingual v2 at 3.9% WER offers strong voice quality for non-real-time applications where latency is not a constraint.

This post focused on TTFA benchmarks across leading TTS providers. For deeper technical context on the topics covered here:

Getting started

Gradium offers a free tier for evaluation. Sign up at gradium.ai, generate an API key, and start streaming TTS in minutes. Documentation and quickstart guides are available at docs.gradium.ai.

For enterprise evaluations or technical questions, reach out at contact@gradium.ai or visit gradium.ai.

Frequently Asked Questions