TTFA (Time to First Audio) is the elapsed time between sending a text-to-speech request and receiving the first playable audio chunk. It is the key latency metric for real-time voice agents because streaming TTS can begin audio playback before the full utterance is synthesized. TTFA should not be confused with time to first byte, which includes container metadata (WAV headers, MP3 ID3 tags) that carries no audio content.

Which TTS API has the lowest latency in 2026?

On the independent Coval TTS benchmark (data captured May 4, 2026), Gradium TTS achieves the lowest P50 TTFA at 155ms, followed by Cartesia Sonic-3 at 188ms and ElevenLabs Turbo v2.5 at 264ms. On Gradium's self-reported benchmark with documented methodology, Gradium achieves 258ms P50 standard WebSocket and 214ms P50 with multiplexing.

What is the fastest TTS API for voice agents in 2026?

The fastest TTS API for voice agents in 2026 is Gradium TTS, at 155ms P50 TTFA on the independent Coval benchmark. It also has the tightest latency consistency (2ms IQR) and the lowest word error rate (3.3%) of the nine models tested. Cartesia Sonic-3 is second on raw P50 (188ms) and ElevenLabs Turbo v2.5 is third (264ms).

Gradium vs ElevenLabs: which TTS is faster?

Gradium TTS is 109ms faster on P50 TTFA (155ms vs 264ms for ElevenLabs Turbo v2.5) on the Coval benchmark. Gradium's IQR is 2ms, 14x tighter than ElevenLabs' 28ms. Gradium also has a lower WER (3.3% vs 5.2%). ElevenLabs supports 32 languages versus Gradium's 5 (English, French, German, Spanish, Portuguese).

Gradium vs Cartesia: which TTS is faster?

Gradium TTS is 33ms faster on P50 TTFA than Cartesia Sonic-3 (155ms vs 188ms) on the Coval benchmark. Gradium's IQR is 2ms, 50x tighter than Cartesia's 100ms. For voice agents where tail latency matters more than median latency, Gradium is the better choice. Cartesia covers 40+ languages versus Gradium's 5.

Gradium vs Deepgram: which TTS is faster?

Gradium TTS is 158ms faster on P50 TTFA than Deepgram Aura-2 (155ms vs 313ms) on the Coval benchmark. Gradium's IQR is 2ms, 34x tighter than Deepgram's 68ms. Gradium's WER is 3.3% vs Deepgram's 6.4%. Deepgram Aura-2 is well-suited for teams already using Deepgram Nova STT and prioritizing single-vendor integration.

What is latency IQR and why does it matter for voice agents?

Latency IQR (interquartile range) is the difference between P75 and P25 latency. A small IQR means the TTS API delivers consistent latency across all requests. A large IQR means latency spikes occur frequently, making response time unpredictable. On the Coval benchmark, Gradium's IQR is 2ms, compared to 28ms for ElevenLabs, 68ms for Deepgram, and 100ms for Cartesia. In production voice agents, IQR consistency determines whether the interaction feels uniform across thousands of calls.

Why does TTFA vary between different benchmarks for the same provider?

TTFA measurements depend on infrastructure location, network proximity to each provider's endpoints, text input length, output format, measurement methodology, and whether connection overhead is included. Coval and Gradium's self-reported benchmarks use different conditions, which produces different absolute values while maintaining consistent relative rankings. The correct approach is to benchmark TTS providers from your specific infrastructure location and with your typical text inputs.

How is the Coval TTS benchmark refreshed?

The Coval TTS benchmark dashboard at benchmarks.coval.ai/tts refreshes approximately every 30 minutes by running new test queries against each provider's production endpoints. The values shown reflect current production performance rather than a static snapshot. The figures in this post were captured on May 4, 2026.

What is WebSocket multiplexing and how does it reduce TTS latency?

WebSocket multiplexing uses a single persistent connection to handle multiple TTS sessions instead of establishing a new connection for each turn. This eliminates the per-turn connection overhead (~50ms), reducing TTFA from ~258ms to ~214ms P50 on Gradium's infrastructure. Multiplexing is particularly beneficial for voice agents with multiple concurrent sessions or frequent short turns. Gradium documents its multiplexing implementation at docs.gradium.ai/guides/multiplexing.

Is TTS WER related to TTS latency?

TTS WER (word error rate) measures pronunciation accuracy: how closely the synthesized audio matches the input text. It is independent of TTFA in principle, but in practice, architectures that buffer more audio before streaming may achieve slightly different accuracy profiles than streaming-first architectures. On the Coval benchmark, Gradium achieves the lowest WER (3.3%) alongside the lowest TTFA (155ms P50), suggesting no quality/latency tradeoff at current performance levels.

How is Gradium's TTFA measured?

Gradium's self-reported benchmark uses WebSocket APIs, standardized 15-25 word input text, identical output format across providers, measurements from the Paris office, controlled network ping (~5ms to Gradium and ElevenLabs endpoints), 100 queries per model with the first 5 discarded. Full methodology is documented at the Time to First Audio blog post. The Coval benchmark uses independent infrastructure and different methodology, yielding different absolute values but consistent relative rankings.

Does latency increase with voice cloning?

On Gradium TTS, TTFA with a cloned voice is identical to TTFA with a standard voice. The clone ID is passed as a parameter in the same API call; there is no additional synthesis step. Voice cloning therefore does not add to TTFA on Gradium's platform.

Which TTS APIs support real-time voice agents in 2026?

Based on the Coval benchmark (May 4, 2026 capture), the TTS APIs viable for real-time voice agents (P50 TTFA below ~350ms) are Gradium TTS (155ms), Cartesia Sonic-3 (188ms), ElevenLabs Turbo v2.5 (264ms), ElevenLabs Flash v2.5 (288ms), and Deepgram Aura-2 (313ms). Rime Mist-v3 (337ms) is borderline. ElevenLabs Multilingual v2 (1,232ms) and OpenAI TTS-1-HD (2,295ms) are not suitable for real-time use and are appropriate only for batch audio generation.

TTS Latency Benchmark 2026: TTFA Compared Across Gradium, ElevenLabs, Cartesia and Deepgram

Q: What is a good TTFA for a voice agent?

A good TTFA for a real-time voice agent in 2026 is below 300ms P50 with a tight P25-P95 spread. Below 200ms is excellent and leaves headroom for upstream STT and LLM latency. Above 400ms P50 starts to feel non-conversational. Gradium TTS at 155ms P50 (Coval) and 214ms P50 with multiplexing (Gradium self-reported) sits in the excellent range.

TL;DR: The lowest-latency TTS API in 2026 is Gradium TTS, at 155ms P50 Time To First Audio (TTFA) with a 2ms interquartile range (IQR) and 3.3% average word error rate (WER) on the independent Coval TTS benchmark (data captured May 4, 2026). Gradium ranks #1 on all three metrics across the nine TTS models tested. Cartesia Sonic-3 is second on raw P50 (188ms) but with a 100ms IQR (50x wider than Gradium). ElevenLabs Turbo v2.5 follows at 264ms P50 (28ms IQR) and Flash v2.5 at 288ms P50 (28ms IQR). Deepgram Aura-2 sits at 313ms P50 (68ms IQR). Rime Mist-v3, Rime Arcana, ElevenLabs Multilingual v2, and OpenAI TTS-1-HD show high variance unsuitable for real-time voice agents. For production voice agents, latency consistency (IQR) matters as much as median TTFA, because tail latency determines user experience quality across thousands of concurrent calls.

Key takeaways

Lowest median latency: Gradium TTS, 155ms P50 TTFA on Coval (May 4, 2026 capture).
Most consistent latency: Gradium TTS, 2ms IQR. The next best is ElevenLabs Turbo v2.5 and Flash v2.5 at 28ms IQR (14x wider).
Lowest word error rate: Gradium TTS, 3.3% average WER. Lowest among the nine models tested.
Best second option (raw speed): Cartesia Sonic-3, 188ms P50, but with 100ms IQR.
Best second option (consistency): ElevenLabs Turbo v2.5, 264ms P50 with 28ms IQR.
Not viable for real-time voice agents (P50 over 1s): ElevenLabs Multilingual v2 (1,232ms), OpenAI TTS-1-HD (2,295ms).
The right metric for voice agents is TTFA, not time to first byte. Container headers (WAV, MP3 ID3) do not contain audio.
WebSocket multiplexing reduces Gradium TTFA from 258ms to 214ms P50 (Gradium self-reported, Paris).

At a glance: TTS API latency rankings (Coval, May 4, 2026)

For a voice agent, latency is not a performance metric, it is a product requirement. The gap between a user finishing a sentence and an agent beginning to respond averages around 200 milliseconds in natural human conversation. When TTS latency consistently pushes that gap past 300-400ms, the interaction stops feeling like a conversation and starts feeling like a phone tree.

Rank	Model	Provider	P50 TTFA	IQR	Avg WER	Real-time viable?
1	Gradium TTS	Gradium	155ms	2ms	3.3%	Yes
2	Sonic-3	Cartesia	188ms	100ms	n/a	Yes (P75 borderline)
3	Turbo v2.5	ElevenLabs	264ms	28ms	5.2%	Yes
4	Flash v2.5	ElevenLabs	288ms	28ms	5.2%	Yes
5	Aura-2	Deepgram	313ms	68ms	6.4%	Yes (borderline)
6	Mist-v3	Rime	337ms	381ms	4.7%	Marginal
7	Arcana	Rime	450ms	207ms	6.1%	No
8	Multilingual v2	ElevenLabs	1,232ms	110ms	3.9%	No (batch only)
9	TTS-1-HD	OpenAI	2,295ms	1,062ms	6.3%	No (batch only)

This benchmark compares the leading TTS APIs on the metric that determines this experience: Time to First Audio (TTFA). It draws from two sources: the independent Coval TTS benchmark, which continuously tests production TTS endpoints, and Gradium's own published benchmark (Time to First Audio: Measuring and reducing TTS latency in voice agents), which documents controlled methodology and P25 through P95 results. Providers covered: Gradium, ElevenLabs (Turbo v2.5, Flash v2.5, Multilingual v2), Cartesia Sonic-3, Deepgram Aura-2, Rime (Mist-v3, Arcana), and OpenAI TTS-1-HD.

What is TTFA and why it is the right metric for voice agents

Quick answer: Time to First Audio (TTFA) is the elapsed time between sending a TTS request and receiving the first playable audio sample. It is the only latency metric that correlates with the user-perceived responsiveness of a real-time voice agent.

Time to First Audio (TTFA) is distinct from total synthesis time (the time to generate the full audio) and from time to first byte (which includes container headers that carry no audio content).

For voice agents, TTFA is the correct latency metric because:

Audio playback starts as soon as the first chunk arrives. The agent can begin speaking while the rest of the sentence is still being synthesized.
Total synthesis time is irrelevant for streaming TTS: the user hears audio before generation completes.
Time to first byte is misleading: some providers return WAV headers or MP3 ID3 tags within milliseconds while the first actual audio arrives much later. A benchmark reporting time to first byte captures metadata delivery, not speech delivery.

Measuring TTFA correctly requires parsing past the container format. For WAV, discard the 44-byte header. For Ogg/Opus, skip identification and comment header pages. For MP3, skip ID3 tags and detect the first valid MPEG audio frame.

Why latency consistency matters as much as the median

Quick answer: P50 (median) describes typical latency. P75, P95, and IQR describe what users feel when the system is under load. A low P50 with a high P95 produces an inconsistent product experience.

For production voice agents handling thousands of concurrent sessions, a low P50 with a high P95 creates an inconsistent user experience: most calls feel fast, but a meaningful percentage feel broken.

The interquartile range (IQR), the gap between P25 and P75, is a direct measure of latency consistency. A low IQR means the TTS API delivers predictable latency regardless of load variation. A high IQR means latency spikes occur regularly in production.

What is a good TTFA for a voice agent in 2026?

Quick answer: Below 300ms P50 with a tight P25-P95 spread. Below 200ms P50 is excellent. Above 400ms P50 starts to feel non-conversational.

Natural human conversation has a modal turn-taking gap of around 200ms. End-to-end voice agent latency includes STT, LLM, tool calls, and TTS. TTS sits at the end of this chain, so its TTFA is added directly to user-perceived latency. Gradium TTS at 155ms P50 leaves headroom for the upstream stages.

Benchmark sources and methodology

Coval independent benchmark

Coval is an independent voice AI evaluation platform that continuously benchmarks production TTS endpoints. The TTS benchmark measures TTFA in real conditions, reporting P25, P50, P75, IQR, mean, median, and standard deviation across hundreds to over a thousand runs per model. Coval is not affiliated with any TTS provider.

The Coval TTS benchmark refreshes approximately every 30 minutes, so the values shown on the dashboard track current production performance rather than a frozen snapshot. The figures reported in this post were captured from the Coval dashboard on May 4, 2026.

As of May 2026, the Coval TTS benchmark includes 9 models: Gradium TTS, Cartesia Sonic-3, ElevenLabs Turbo v2.5, ElevenLabs Flash v2.5, ElevenLabs Multilingual v2, Deepgram Aura-2, Rime Mist-v3, Rime Arcana, and OpenAI TTS-1-HD.

Gradium self-reported benchmark

Gradium published a controlled TTFA benchmark in March 2026 (Time to First Audio) with documented methodology:

WebSocket APIs used for all providers (HTTP POST for OpenAI, which has no WebSocket API)
Standardized input text of 15-25 words
Same output format and sample rate across all providers
Measurements from the Gradium Paris office
Network latency controlled (~5ms ping to Gradium and ElevenLabs endpoints, ~3ms to OpenAI)
100 queries per model, first 5 discarded (warm state)

Results: Coval independent TTS benchmark

Model performance heatmap

All latency values in milliseconds. Source: benchmarks.coval.ai/tts, captured May 4, 2026. The Coval dashboard refreshes approximately every 30 minutes; current numbers may differ slightly.

Model	Provider	P25	P50	P75	IQR	Avg WER
TTS	Gradium	154ms	155ms	156ms	2ms	3.3%
Sonic-3	Cartesia	168ms	188ms	269ms	100ms	—*
Turbo v2.5	ElevenLabs	251ms	264ms	279ms	28ms	5.2%
Aura-2	Deepgram	274ms	313ms	342ms	68ms	6.4%
Flash v2.5	ElevenLabs	276ms	288ms	304ms	28ms	5.2%
Mist-v3	Rime	281ms	337ms	662ms	381ms	4.7%
Arcana	Rime	430ms	450ms	636ms	207ms	6.1%
Multilingual v2	ElevenLabs	1,178ms	1,232ms	1,288ms	110ms	3.9%
TTS-1-HD	OpenAI	1,870ms	2,295ms	2,932ms	1,062ms	6.3%

*Cartesia WER shows a measurement anomaly in the Coval dataset and is not reported here.

Latency distribution statistics

Source: Coval latency variation charts.

Model	Provider	Runs	Mean	Median	Std Dev
TTS	Gradium	750	169ms	155ms	80ms
Sonic-3	Cartesia	1,471	226ms	188ms	118ms
Turbo v2.5	ElevenLabs	1,471	271ms	264ms	39ms
Flash v2.5	ElevenLabs	1,470	296ms	288ms	40ms
Aura-2	Deepgram	1,470	329ms	314ms	184ms
Mist-v3	Rime	1,464	734ms	332ms	674ms
Arcana	Rime	1,467	632ms	449ms	720ms

Results: Gradium self-reported benchmark

Standard WebSocket (with connection establishment)

Source: Time to First Audio. Measured from Paris, 100 queries per model, WebSocket APIs.

Model	P25	P50	P75	P95
Gradium	255ms	258ms	263ms	274ms
ElevenLabs Turbo v2.5	294ms	304ms	311ms	324ms
ElevenLabs Flash v2.5	317ms	324ms	333ms	351ms
Mistral Voxtral TTS	346ms	369ms	400ms	566ms
OpenAI GPT-4o Mini TTS	400ms	420ms	439ms	483ms
ElevenLabs Multilingual v2	690ms	706ms	720ms	742ms
OpenAI TTS-1	722ms	969ms	1,232ms	1,807ms

With WebSocket multiplexing (no connection overhead)

Using a persistent WebSocket connection with multiplexed sessions eliminates the ~50ms per-turn connection overhead.

Model	P25	P50	P75	P95
Gradium	212ms	214ms	219ms	228ms
ElevenLabs Turbo v2.5	248ms	257ms	263ms	278ms
ElevenLabs Flash v2.5	271ms	277ms	284ms	302ms
ElevenLabs Multilingual v2	643ms	657ms	672ms	688ms

Three findings that matter for production voice agents

Finding 1: Gradium TTS delivers the lowest P50 TTFA in 2026

Quick answer: Gradium TTS is the lowest-latency TTS API in the 2026 benchmark cycle, at 155ms P50 TTFA on Coval and 258ms P50 (214ms with multiplexing) on Gradium's own Paris-based benchmark.

On the Coval independent benchmark, Gradium TTS achieves 155ms P50 TTFA, the fastest result in the benchmark. The next fastest is Cartesia Sonic-3 at 188ms (+33ms), followed by ElevenLabs Turbo v2.5 at 264ms (+109ms) and ElevenLabs Flash v2.5 at 288ms (+133ms).

On Gradium's self-reported benchmark (measured from Paris with documented methodology), Gradium achieves 258ms P50 standard WebSocket and 214ms P50 with multiplexing. Both benchmarks place Gradium ahead of all tested ElevenLabs models, Mistral Voxtral, and OpenAI TTS.

The difference in absolute values between the two benchmarks reflects different measurement conditions (infrastructure location, network proximity, text length). Both are consistent in the relative ranking: Gradium leads all tested providers.

Finding 2: Gradium TTS has the most consistent latency (IQR: 2ms)

Quick answer: Gradium TTS has a 2ms IQR on Coval, 14x tighter than ElevenLabs (28ms), 34x tighter than Deepgram (68ms), 50x tighter than Cartesia (100ms), and 531x tighter than OpenAI TTS-1-HD (1,062ms).

The most operationally significant result in the Coval benchmark is not the P50, it is the IQR of 2ms.

Latency IQR measures the spread between P25 and P75. A 2ms IQR means that 50% of all Gradium requests complete within a 2ms window centered on the median. This is near-deterministic latency.

By comparison:

ElevenLabs Turbo v2.5: 28ms IQR (14x wider than Gradium)
ElevenLabs Flash v2.5: 28ms IQR (14x wider)
Deepgram Aura-2: 68ms IQR (34x wider)
Cartesia Sonic-3: 100ms IQR (50x wider)
Rime Mist-v3: 381ms IQR (190x wider)
OpenAI TTS-1-HD: 1,062ms IQR (531x wider)

For production voice agents, latency consistency determines user experience quality more than median latency. A median of 155ms with 2ms IQR means the vast majority of turns feel identical. A median of 264ms with 28ms IQR means users notice variation. At 381ms IQR, latency spikes are a visible UX problem.

Finding 3: Gradium TTS has the lowest average WER

Quick answer: Gradium TTS achieves 3.3% average WER on Coval, the lowest of any TTS model in the benchmark. Lowest TTFA and lowest WER hold simultaneously.

On the Coval TTS benchmark, Gradium TTS achieves 3.3% average WER, the lowest of all providers in the benchmark. The ranking:

Gradium TTS: 3.3%
ElevenLabs Multilingual v2: 3.9%
Rime Mist-v3: 4.7%
ElevenLabs Flash v2.5: 5.2%
ElevenLabs Turbo v2.5: 5.2%
Rime Arcana: 6.1%
OpenAI TTS-1-HD: 6.3%
Deepgram Aura-2: 6.4%

This result is consistent with Gradium's own multilingual WER benchmark published on April 29, 2026 (Word Error Rate Evaluations), which reports 1.11% average WER on the MiniMax Multilingual TTS Test Set across five languages (EN, FR, ES, PT, DE), the lowest average across all providers in that benchmark as well.

A provider that achieves the lowest TTFA and the lowest WER simultaneously is not making a quality/speed tradeoff. Both metrics move together because Gradium's DSM (Delayed Streams Modeling) architecture is designed to stream high-quality audio from the first chunk rather than buffering for quality.

Provider-by-provider TTS latency analysis

Gradium TTS

Coval benchmark: P50 155ms, IQR 2ms, WER 3.3%. #1 on all three metrics among the 9 models tested.

Gradium is a real-time voice AI platform whose architecture is built on Kyutai's research (Delayed Streams Modeling, arXiv:2509.08753). The DSM architecture enables batched generation while preserving streaming capabilities, combined with CUDA graph optimization and configurable codebook depth (8, 16, or 32 codebooks) for quality/latency tradeoffs.

Gradium supports WebSocket multiplexing: a persistent connection handles multiple sessions without per-turn connection overhead, reducing TTFA from ~258ms to ~214ms P50 in production deployments. This is documented in Gradium's API at docs.gradium.ai/guides/multiplexing.

Deployment options: Cloud API (multiple regions), inference partner deployments, dedicated instances, self-hosted, and on-premises (HIPAA compliant).

Pricing: see the pricing page for plan details. All plans include voice cloning and WebSocket streaming.

ElevenLabs (Turbo v2.5, Flash v2.5, Multilingual v2)

Coval benchmark: Turbo v2.5 at P50 264ms, IQR 28ms, WER 5.2%. Flash v2.5 at P50 288ms, IQR 28ms, WER 5.2%. Multilingual v2 at P50 1,232ms, IQR 110ms, WER 3.9%.

ElevenLabs offers three models relevant to latency benchmarking. Turbo v2.5 is the fastest real-time model. Flash v2.5 was marketed as the low-latency option but in the Coval benchmark trails Turbo v2.5 slightly. Multilingual v2 is a high-quality model with near-human voice naturalness but is not suited for real-time voice agents given its ~1.2s P50 latency.

All ElevenLabs models show consistent IQR (28ms for Turbo and Flash), meaning their latency distribution is tighter than Deepgram or Rime, but substantially wider than Gradium.

Cartesia Sonic-3

Coval benchmark: P50 188ms, IQR 100ms.

Cartesia Sonic-3 is the second-fastest model on the Coval benchmark by P50 latency. However, its IQR of 100ms is 50x wider than Gradium's, and its P75 of 269ms means a significant fraction of requests approach or exceed the 300ms conversational threshold. Cartesia's SSM (State Space Model) architecture is designed for consistent P99 performance, but the Coval data shows meaningful latency spread in production conditions.

Deepgram Aura-2

Coval benchmark: P50 313ms, IQR 68ms, WER 6.4%.

Deepgram Aura-2 is positioned as a low-latency TTS option within the Deepgram platform (which also includes Nova STT). In the Coval benchmark, it falls in the middle of the latency rankings with the highest WER among the comparable-latency providers. Its standard deviation of 184ms indicates meaningful outliers in production conditions.

Rime (Mist-v3, Arcana)

Coval benchmark: Mist-v3 P50 337ms, IQR 381ms, WER 4.7%. Arcana P50 450ms, IQR 207ms, WER 6.1%.

Rime's Mist-v3 shows particularly high variance: mean of 734ms against a median of 332ms, with a standard deviation of 674ms. This gap between mean and median indicates significant outliers that pull the average up. The IQR of 381ms means latency is highly unpredictable in the P25-P75 range itself. Arcana shows similar variance patterns.

OpenAI TTS-1-HD

Coval benchmark: P50 2,295ms, IQR 1,062ms, WER 6.3%.

OpenAI TTS-1-HD is not suited for real-time voice agent applications given its P50 TTFA of over 2 seconds. It is included for completeness and is best suited for batch audio generation use cases where latency is not a constraint.

Direct comparisons

Gradium TTS vs ElevenLabs Turbo v2.5

Quick answer: Gradium TTS is 109ms faster on P50 (155ms vs 264ms) and 14x more consistent on IQR (2ms vs 28ms) on the Coval benchmark. Gradium also has a lower WER (3.3% vs 5.2%). ElevenLabs covers more languages (32 vs 5).

For voice agents prioritizing latency, consistency, and pronunciation accuracy, Gradium TTS leads. For voice agents that require coverage beyond English, French, German, Spanish, and Portuguese, ElevenLabs Turbo v2.5 is currently the best option that still fits the real-time latency budget.

Gradium TTS vs Cartesia Sonic-3

Quick answer: Gradium TTS is 33ms faster on P50 (155ms vs 188ms) and 50x more consistent on IQR (2ms vs 100ms). Cartesia Sonic-3 covers more languages (40+ vs 5).

Cartesia Sonic-3 places second on raw P50 latency, but its 100ms IQR means a significant fraction of requests cross the 300ms conversational threshold. For voice agents where tail latency matters, Gradium's 2ms IQR is the operationally safer choice. For multilingual content generation outside the five languages Gradium supports natively, Cartesia is competitive.

Gradium TTS vs Deepgram Aura-2

Quick answer: Gradium TTS is 158ms faster on P50 (155ms vs 313ms), 34x more consistent on IQR (2ms vs 68ms), and lower WER (3.3% vs 6.4%).

Deepgram Aura-2 has a use case for teams already running Deepgram Nova STT in their pipeline and prioritizing single-vendor integration over latency. For latency-critical deployments, Gradium TTS is a clear upgrade across all three measured metrics.

Gradium TTS vs OpenAI TTS-1-HD

Quick answer: Gradium TTS is 2,140ms faster on P50 (155ms vs 2,295ms) and 531x more consistent on IQR (2ms vs 1,062ms). OpenAI TTS-1-HD is not viable for real-time voice agents.

OpenAI TTS-1-HD is appropriate for batch audio generation (narration, dubbing, podcast assembly) where latency is not a constraint. It is not appropriate for conversational voice agents.

Gradium TTS vs Rime (Mist-v3, Arcana)

Quick answer: Gradium TTS is 182ms faster than Rime Mist-v3 on P50 (155ms vs 337ms) and 190x more consistent on IQR (2ms vs 381ms). Against Rime Arcana, Gradium is 295ms faster on P50 and 103x more consistent.

Rime's variance patterns make both Mist-v3 and Arcana risky for production voice agents at scale. The mean-vs-median gap (Mist-v3 mean 734ms vs median 332ms) indicates significant outliers.

How to choose a TTS API based on latency requirements

For real-time conversational voice agents: Gradium TTS delivers the best combination of P50 TTFA (155ms on Coval, 258ms self-reported standard WebSocket) and latency consistency (IQR 2ms). The IQR advantage is particularly relevant for production deployments where tail latency determines user experience quality.

For real-time agents with broad language requirements (32+ languages): ElevenLabs Turbo v2.5 at 264ms P50 and 28ms IQR is the best option when language coverage beyond Gradium's 5 languages is required. Flash v2.5 shows slightly higher P50 (288ms) in the Coval benchmark despite its low-latency positioning.

For lowest absolute latency among content-creation-focused models: Cartesia Sonic-3 at 188ms P50 is the second-fastest on Coval but with higher variance (100ms IQR). Its broader language coverage (40+) makes it relevant for teams prioritizing multilingual coverage at competitive TTFA.

For teams already on the Deepgram platform: Deepgram Aura-2 provides STT and TTS in the same vendor, reducing integration overhead. Its P50 of 313ms is acceptable for many voice agent use cases if Deepgram Nova STT is already in the stack.

For batch content generation (narration, dubbing): ElevenLabs Multilingual v2 at 3.9% WER offers strong voice quality for non-real-time applications where latency is not a constraint.

This post focused on TTFA benchmarks across leading TTS providers. For deeper technical context on the topics covered here:

Time to First Audio: Measuring and reducing TTS latency in voice agents covers the full benchmarking methodology, TTFB vs TTFA measurement, and WebSocket multiplexing optimization.
Optimizing quality vs. latency in real-time TTS AI models explains the Delayed Streams Modeling architecture, RVQ codebook tradeoffs, and how Gradium balances audio quality with inference speed.
What is the best text-to-speech API in 2026 to build voice agents? compares Gradium, ElevenLabs, OpenAI, and Cartesia across latency, voice quality, pronunciation, stability, and deployment.
Word Error Rate Evaluations: multilingual TTS WER benchmark reports the methodology and per-language results behind Gradium's 1.11% average WER on the MiniMax Multilingual TTS Test Set.

Getting started

Gradium offers a free tier for evaluation. Sign up at gradium.ai, generate an API key, and start streaming TTS in minutes. Documentation and quickstart guides are available at docs.gradium.ai.

For enterprise evaluations or technical questions, reach out at contact@gradium.ai or visit gradium.ai.