What is TTFA and how is it measured?

TTFA (Time to First Audio) is the time between an API receiving a text input and returning the first audio chunk. It is the primary latency metric for real-time TTS APIs. TTFA does not measure total synthesis time, which scales with text length. Most published TTFA figures are P50 (median) values measured on a single request. P99 figures are rarely published but are more relevant for production planning.

What is the fastest TTS API in 2026?

Based on the independent Coval benchmark, Gradium achieves the lowest TTFA at 155 ms P50. Cartesia Sonic-3 follows at 188 ms P50, ElevenLabs Flash v2.5 at 288 ms P50, and Deepgram Aura-2 at 313 ms P50. Gradium and Cartesia are well below the 200 to 300 ms threshold above which latency becomes perceptible in conversation; ElevenLabs Flash v2.5 is at the boundary. However, TTFA alone does not determine the full pipeline latency of a voice agent: STT, LLM first-token, and connection overhead each contribute to the total turn latency.

What is the difference between P50 and P99 latency in TTS APIs?

P50 (median) is the latency at or below which 50 percent of requests complete. P99 is the latency at or below which 99 percent complete. P50 represents average-case performance. P99 determines the worst 1 percent of user experiences. In production voice agents handling thousands of daily interactions, P99 latency determines the frequency of conversation-breaking pauses. Cartesia Sonic-3's State Space Model architecture is specifically designed for low P99 variance.

Does WebSocket streaming reduce TTS latency?

Yes, specifically in multi-turn conversations. WebSocket streaming maintains a persistent connection across turns, eliminating the TCP and TLS handshake overhead (40 to 100 ms) that HTTP chunk streaming incurs on every new request. In a 10-turn conversation, HTTP overhead accumulates to 400 to 1,000 ms of latency not visible in single-request benchmarks. Gradium, Cartesia, and Deepgram Aura-2 use WebSocket streaming natively. ElevenLabs Flash v2.5 uses HTTP streaming.

What is WebSocket multiplexing and how does it reduce latency?

WebSocket multiplexing allows multiple concurrent audio streams to share a single persistent connection. In addition to eliminating per-turn reconnection overhead, multiplexing reduces the effective TTFA by reusing the already-warm connection state. Gradium's multiplexed WebSocket configuration reduces TTFA from 228 ms (32-codebook standard) to 214 ms, a saving of approximately 14 ms per turn from connection reuse alone. Combined with lower codebook configs, the range extends down to 160 ms TTFA.

How does TTS latency affect the full voice agent pipeline?

A complete voice agent turn involves STT (100 to 300 ms), LLM first-token (200 to 500 ms), TTS TTFA (75 to 300 ms), and connection overhead. For the total round-trip to stay under 800 ms, each component must be optimized. TTS is often the last stage and receives the tightest latency budget. With streaming LLMs that emit first tokens quickly, TTS synthesis can begin before the full LLM response is available, reducing effective end-to-end latency. This LLM-TTS interleaving requires a TTS API that supports streaming text input, which all providers in this comparison support.

What is the total pipeline latency of Gradium's voice stack?

With a streaming LLM such as GPT-4 Turbo or Claude, Gradium's full pipeline (STT plus LLM plus TTS) achieves 420 to 520 ms total turn latency. This compares to 2.5 to 5.5 seconds with non-streaming, batch architectures. Gradium's WebSocket-native design and LLM-TTS interleaving capability are the primary factors enabling this figure.

Is 75 ms TTS TTFA always better than 200 ms?

Not in all scenarios. In a pipeline where STT contributes 150 ms and LLM contributes 250 ms, the TTS contribution to total latency is 75 ms vs 200 ms, a 125 ms difference. Whether this difference is perceptible depends on pipeline configuration, network conditions, and turn frequency. For single-turn or low-frequency interactions, both are well within conversational thresholds. For high-frequency, high-concurrency deployments, every millisecond compounds. The choice should be made based on total pipeline latency, not TTS TTFA in isolation.

How do I get started with Gradium for low-latency TTS?

Gradium offers a free plan at $0 per month with 45,000 credits (approximately one hour of TTS or four hours of STT), which is enough to benchmark the API in your own pipeline. Paid plans start at $13 per month on the XS tier. Official Python and Rust SDKs are available, along with LiveKit and Pipecat integrations for voice agent stacks. See the Gradium pricing page for the full plan breakdown.

Which TTS API has the most consistent P99 latency?

Cartesia Sonic-3 has the strongest architectural case for low P99 variance because its State Space Model architecture produces more predictable token-by-token generation than standard transformer models. Gradium also reports low variance between P50 and P90 (P50 214 ms vs P95 228 ms excluding connection establishment) thanks to CUDA Graph optimization. Neither provider publishes a precise P99 figure, but both target tail latency at the architecture level.

Best Low-Latency TTS APIs in 2026: TTFA, P99 and Pipeline Impact

Q: What are codebooks in TTS and how do they affect latency?

Codebooks are a configuration parameter in Residual Vector Quantization based TTS models. More codebooks capture finer audio detail, producing higher fidelity output at the cost of increased synthesis time (higher TTFA). Fewer codebooks reduce TTFA at the cost of some audio resolution. Gradium exposes this tradeoff directly: 8 codebooks (160 ms TTFA, 7.71x audio-to-real-time ratio) vs 32 codebooks (228 ms TTFA, 4.39x). This configurability lets developers choose the right precision-latency balance per deployment.

Latency is the most operationally critical dimension of a TTS API in real-time applications. A 100 ms difference in time to first audio is the difference between a voice agent that feels responsive and one that creates a measurable pause before every response. The 200–300 ms threshold above which humans perceive conversational delay sits squarely inside the operating range of every TTS provider on the market, which means the choice of TTS model, and its underlying transport, architecture, and configuration options, directly determines whether your voice agent feels natural or sluggish.

This guide focuses on the latency characteristics of the leading TTS APIs in 2026: what TTFA each achieves, how architecture affects consistency at P99, how configuration options allow per-deployment tuning, and how TTS latency fits into the broader voice agent pipeline. We cover Gradium, ElevenLabs Flash v2.5, Cartesia Sonic-3, and Deepgram Aura-2, drawing on the independent Coval benchmark and Gradium's own published TTFA methodology.

What Is TTS Latency and Why Does It Matter?

Text-To-Speech latency in a real-time context is measured as TTFA (Time to First Audio): the time elapsed between the API receiving the text input and delivering the first audio chunk to the client. It does not measure the time to render the complete audio, which scales with text length. TTFA determines when audio playback can begin.

In batch or file-based TTS use cases (audiobook generation, content dubbing), TTFA is largely irrelevant; total render time and audio quality take priority.

In real-time use cases (voice agents, AI phone calls, interactive assistants), TTFA is the primary latency metric. The human threshold for perceiving a conversational pause as awkward is approximately 200–300 ms. A TTS API contributing more than 300 ms TTFA to a pipeline is a structural bottleneck, regardless of how good the underlying audio sounds.

How Does TTS Latency Fit Into the Full Voice Agent Pipeline?

TTS latency does not exist in isolation. A complete voice agent turn involves three latency-contributing stages plus transport overhead:

Stage	Component	Typical range	Notes
Transcription	STT (end-of-utterance to transcript)	100–300 ms	Varies by model and streaming config
LLM inference	Language model (first token)	200–500 ms	Depends on model size and infrastructure
Speech synthesis	TTS (TTFA)	75–300 ms	Subject of this guide
Network + connection	WebSocket vs HTTP overhead	20–100 ms per turn	Eliminated with persistent WebSocket

For conversation to feel natural, the full pipeline should complete in under 800 ms. With STT at 150 ms and LLM at 250 ms, the TTS budget is approximately 200–400 ms, depending on network conditions. This is why TTS APIs with TTFA above 300 ms create a structural problem in production voice agents, regardless of audio quality.

With streaming LLMs that emit the first token quickly, the TTS API can begin synthesizing before the full LLM response is available. This LLM–TTS interleaving reduces effective end-to-end latency, but requires the TTS API to support streaming input (receiving and synthesizing text incrementally as tokens arrive). All providers compared in this guide support streaming input; not all use streaming transport.

Why Do P50 and P99 Both Matter?

Published latency benchmarks typically report P50 (median): the latency at or below which 50% of requests complete. P50 represents average-case performance and is the figure providers tend to lead with.

For production systems, P99 (the latency at or below which 99% of requests complete) is the operationally relevant metric. P99 determines the worst 1% of user experiences, which in high-volume deployments represents thousands of interactions per day.

An API with a 90 ms P50 but a 600 ms P99 produces frequent conversation-breaking pauses in production. An API with a 250 ms P50 and a 280 ms P99 delivers highly consistent performance. For voice agents, P99 consistency often matters more than headline P50 figures, and tail latency is where architectural choices (transformer vs State Space Model, batched vs streaming inference) show up most clearly.

How Does WebSocket Streaming Affect TTS Latency?

Beyond model inference time, the transport layer contributes latency on every request.

HTTP chunk transfer encoding opens a new TCP connection per request. The connection handshake (TCP + TLS) adds approximately 40–100 ms per turn, accumulated across every exchange in a conversation. In a 10-turn conversation, this adds 400–1,000 ms of cumulative overhead that does not appear in published TTFA benchmarks (which typically measure a single request).
WebSocket-based streaming maintains a persistent bidirectional connection. After the initial handshake, subsequent turns incur no connection overhead. For multi-turn voice agents, WebSocket architecture is a significant latency advantage that does not show up in single-request benchmarks.
WebSocket multiplexing takes this further: reusing a single connection concurrently across multiple streams, further reducing per-turn overhead. Gradium supports WebSocket multiplexing, reducing its effective TTFA to 214 ms in multiplexed configurations (vs 228 ms in standard 32-codebook configuration).

ElevenLabs Flash v2.5

ElevenLabs Flash v2.5 was explicitly designed for real-time conversational AI use cases as an alternative to ElevenLabs Multilingual v2, which targets audio quality over latency (1,232 ms P50 on the Coval benchmark). On the same independent benchmark, Flash v2.5 achieves 288 ms P50 TTFA.

Latency Profile

TTFA: 288 ms P50 (Coval benchmark)
Transport: HTTP chunk transfer encoding (not a persistent WebSocket)
P99: not publicly published

What Is the Latency Tradeoff?

ElevenLabs Flash v2.5 (288 ms P50) is slower than Cartesia Sonic-3 (188 ms P50) and Gradium (155 ms P50) on independent measurement. In a pipeline where STT contributes 150 ms and LLM first-token contributes 250 ms, the difference between 155 ms and 288 ms TTS TTFA is 133 ms of total pipeline latency, enough to push a borderline pipeline over the 800 ms naturalness threshold.

The transport layer compounds the gap: ElevenLabs Flash v2.5 uses HTTP streaming, which adds connection overhead per turn (40–100 ms) in multi-turn conversations. This overhead is not captured in single-request TTFA benchmarks but is material in production agents.

Languages and Pricing

32 languages, matching ElevenLabs' broader Multilingual catalogue. Pricing is credit-based at 0.5 credits per character; the effective per-character cost depends on the subscription tier.

Best For

ElevenLabs Flash v2.5 is the right choice when ElevenLabs' voice catalogue is a hard requirement, broad language coverage matters, and the use case is low to medium turn-frequency (where per-request connection overhead is less material). For a deeper Gradium-vs-ElevenLabs breakdown, including TTS, STT, voice cloning, and deployment, see the ElevenLabs alternative comparison.

Cartesia Sonic-3

Cartesia built Sonic-3 on State Space Model (SSM) architecture. SSMs are inherently more efficient for sequential token generation than standard transformer architectures, which translates to more predictable latency distribution, particularly at P99.

Latency Profile

TTFA: 188 ms P50 (Coval benchmark)
Transport: WebSocket and REST (both available)
P99 consistency: SSM architecture produces low-variance latency. The gap between P50 and P99 is a key differentiator of Sonic-3 vs transformer-based models.

Why Does P99 Matter for Sonic-3?

Transformer-based TTS models can exhibit latency spikes under load or for certain input patterns (longer inputs, unusual phoneme sequences, dense punctuation). SSM architecture reduces these variance sources. In production deployments running thousands of concurrent sessions, P99 consistency determines the frequency of conversation-breaking pauses.

Cartesia does not publish specific P99 figures, but the SSM architecture rationale for low-variance latency is documented and shows up in production reports.

Languages and Pricing

40+ languages with regional accent variants. Pricing is approximately $0.03 per minute, with Pro ($4/month), Startup ($39/month), and Scale ($239/month) tiers (annual billing).

Best For

Cartesia Sonic-3 is the right choice when P99 consistency (not just P50) is the primary latency requirement. The SSM architecture makes it particularly suited to high-concurrency deployments where tail latency impacts a meaningful share of user interactions. For a side-by-side Gradium-vs-Cartesia breakdown, see the Cartesia alternative comparison.

Gradium

Gradium was founded by Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez, co-founders of Kyutai. Kyutai released Moshi (real-time Speech-To-Speech) and Hibiki (live Speech-To-Speech translation). Gradium's latency profile is distinct from the other providers in this comparison because it exposes configurable precision–latency tradeoffs through its codebook architecture, derived from Kyutai's Delayed Streams Modeling research.

Latency Profile

On the independent Coval benchmark, Gradium achieves 155 ms P50 TTFA in its default API configuration. On Gradium's own published TTFA methodology (Paris, 15–25 word sentence, WebSocket, 100 queries, warm), Gradium measures P50 258 ms and P95 274 ms end-to-end, or P50 214 ms and P95 228 ms excluding connection establishment.

Gradium also exposes configurable precision–latency tradeoffs through Residual Vector Quantization with a configurable number of codebooks. More codebooks produce higher audio fidelity but increase TTFA. Fewer codebooks reduce TTFA at the cost of some audio resolution. The codebook count is set per session via json_config, see the json_config guide for the full configuration surface.

Codebook config	TTFA (self-reported)	Audio-to-real-time ratio	Recommended deployment
8 codebooks	160 ms	7.71x	Notifications, alerts, high-frequency turns
16 codebooks	185 ms	6.16x	High-volume production deployments
32 codebooks	228 ms	4.39x	Premium voice agents, brand voices

WebSocket Multiplexing

Gradium's TTS API is WebSocket-native. Connection multiplexing reuses a single persistent connection across multiple conversation turns, reducing effective TTFA to 214 ms in standard configuration by eliminating per-turn connection overhead.

In a 10-turn conversation with 50 ms connection overhead per turn, multiplexing saves approximately 450 ms of accumulated latency compared to HTTP-per-request architectures. This saving is invisible in single-request benchmarks but material in production multi-turn agents.

CUDA Graph Optimization

Gradium's inference stack uses CUDA Graph optimization, which reduces model inference overhead on NVIDIA GPUs (L4, A10, H100). This contributes to low variance between P50 and P90 latency figures, and is part of why Gradium's published end-to-end P95 (274 ms) sits only 16 ms above its P50.

Total Pipeline Latency

With a streaming LLM (GPT-4 Turbo, Claude), Gradium's full pipeline (STT + LLM + TTS) achieves 420–520 ms total turn latency, compared to 2.5–5.5 seconds with non-streaming architectures. The WebSocket-native design enables LLM–TTS interleaving: TTS synthesis begins as the LLM emits its first tokens, before the full response is available. For an end-to-end voice agent walkthrough, see how to build a voice AI agent with Gradium and LiveKit.

Languages

English, French, Spanish, German, Portuguese, with regular updates. Mid-sentence code-switching is supported across all five languages with no latency penalty.

Pricing

Credit-based plans starting at $0/month for the free tier (45,000 credits, roughly 1 hour of TTS) and scaling to $1,615/month on Plan L. See Gradium pricing for the full plan breakdown.

Best For

Gradium offers the best balance between configurable low latency, WebSocket-native streaming, and unified TTS + STT infrastructure. Teams that need to tune latency vs audio quality per deployment (different codebook configs for different endpoints) benefit from Gradium's precision controls. The multiplexing advantage is particularly valuable in high-turn-frequency voice agent deployments. For a deeper dive on Gradium for voice agents specifically, see the best Text-To-Speech API for voice agents.

Deepgram Aura-2

Deepgram Aura-2 is a low-latency TTS model integrated into Deepgram's STT-first platform. It is designed to pair with Deepgram Nova-3 in voice agent pipelines.

Latency Profile

TTFA: 313 ms P50 on the Coval benchmark
Transport: WebSocket
Languages: 7 (English, Spanish, French, German, Dutch, Italian, Japanese)

Deepgram's Voice Agent API uses Aura-2 as the TTS component within a bundled STT + LLM + TTS orchestration endpoint, reducing integration overhead for teams already using Deepgram Nova.

Pricing

$0.030 per 1,000 characters ($0.027 at Growth tier).

Best For

Deepgram Aura-2 is the right choice for teams already using Deepgram Nova-3 for STT who want to consolidate latency-critical infrastructure on a single vendor. HIPAA-compliant on-premise deployment is available. For a side-by-side Gradium-vs-Deepgram breakdown, including TTS, voice cloning, and deployment, see the Deepgram alternative comparison.

How Should You Match TTS Latency to Your Use Case?

Not all real-time applications have the same latency tolerance. The table below maps TTFA ranges to application requirements.

TTFA range	Perception	Suitable use cases	Providers in this range
Under 100 ms	Imperceptible delay	Ultra-low latency voice agents, gaming real-time dialogue	No major provider benchmarked in this range by the Coval benchmark (2026)
100–200 ms	Not perceived as delay in conversation	Voice agents, AI phone calls, interactive assistants	Gradium (155 ms P50 Coval), Gradium 8-codebook self-reported (160 ms), Gradium 16-codebook self-reported (185 ms), Cartesia Sonic-3 (188 ms P50 Coval)
200–300 ms	At threshold of perceptible pause	Voice agents (acceptable), content generation	Gradium multiplexed WS self-reported (214 ms), Gradium 32-codebook self-reported (228 ms), ElevenLabs Flash v2.5 (288 ms P50 Coval)
300 ms+	Perceptible pause	Content creation, batch narration, non-real-time	Deepgram Aura-2 (313 ms P50 Coval), ElevenLabs Multilingual v2 (1,232 ms P50 Coval), most general-purpose TTS

How Should You Choose a Low-Latency TTS API?

Choose ElevenLabs Flash v2.5 if ElevenLabs' voice quality and broad language catalogue (32 languages) are primary criteria alongside competitive latency. Plan for HTTP streaming overhead in multi-turn deployments.
Choose Cartesia Sonic-3 if P99 consistency matters as much as P50, or if broad language coverage (40+) is required alongside low latency. The SSM architecture delivers the most predictable tail latency in this comparison.
Choose Gradium if you need the lowest independently benchmarked TTFA (155 ms P50 on the Coval benchmark), configurable TTFA via codebook config (160–228 ms self-reported), WebSocket-native streaming with multiplexing, unified TTS + STT infrastructure, and the ability to tune precision vs latency per deployment context. The full pipeline latency of 420–520 ms (STT + LLM + TTS) with streaming LLMs is among the lowest in production voice agent stacks.
Choose Deepgram Aura-2 if you are already on Deepgram Nova for STT and want consistent low-latency TTS without adding a new vendor.

Also comparing Cartesia, ElevenLabs, or Deepgram head-to-head with Gradium? Each comparison goes deeper on TTS quality, STT, voice cloning, and deployment options.

What Is TTS Latency and Why Does It Matter?

How Does TTS Latency Fit Into the Full Voice Agent Pipeline?

Why Do P50 and P99 Both Matter?

How Does WebSocket Streaming Affect TTS Latency?

ElevenLabs Flash v2.5

Latency Profile

What Is the Latency Tradeoff?

Languages and Pricing

Best For

Cartesia Sonic-3

Latency Profile

Why Does P99 Matter for Sonic-3?

Languages and Pricing

Best For

Gradium

Latency Profile

WebSocket Multiplexing

CUDA Graph Optimization

Total Pipeline Latency

Languages

Pricing

Best For

Deepgram Aura-2

Latency Profile

Pricing

Best For

How Should You Match TTS Latency to Your Use Case?

How Should You Choose a Low-Latency TTS API?

Frequently Asked Questions