Best Low-Latency TTS APIs in 2026: TTFA, P99 and Pipeline Impact

14 min read

Latency is the most operationally critical dimension of a TTS API in real-time applications. A 100 ms difference in time to first audio is the difference between a voice agent that feels responsive and one that creates a measurable pause before every response. The 200–300 ms threshold above which humans perceive conversational delay sits squarely inside the operating range of every TTS provider on the market, which means the choice of TTS model, and its underlying transport, architecture, and configuration options, directly determines whether your voice agent feels natural or sluggish.

This guide focuses on the latency characteristics of the leading TTS APIs in 2026: what TTFA each achieves, how architecture affects consistency at P99, how configuration options allow per-deployment tuning, and how TTS latency fits into the broader voice agent pipeline. We cover Gradium, ElevenLabs Flash v2.5, Cartesia Sonic-3, and Deepgram Aura-2, drawing on the independent Coval benchmark and Gradium's own published TTFA methodology.

What Is TTS Latency and Why Does It Matter?

Text-To-Speech latency in a real-time context is measured as TTFA (Time to First Audio): the time elapsed between the API receiving the text input and delivering the first audio chunk to the client. It does not measure the time to render the complete audio, which scales with text length. TTFA determines when audio playback can begin.

In batch or file-based TTS use cases (audiobook generation, content dubbing), TTFA is largely irrelevant; total render time and audio quality take priority.

In real-time use cases (voice agents, AI phone calls, interactive assistants), TTFA is the primary latency metric. The human threshold for perceiving a conversational pause as awkward is approximately 200–300 ms. A TTS API contributing more than 300 ms TTFA to a pipeline is a structural bottleneck, regardless of how good the underlying audio sounds.

How Does TTS Latency Fit Into the Full Voice Agent Pipeline?

TTS latency does not exist in isolation. A complete voice agent turn involves three latency-contributing stages plus transport overhead:

Stage Component Typical range Notes
Transcription STT (end-of-utterance to transcript) 100–300 ms Varies by model and streaming config
LLM inference Language model (first token) 200–500 ms Depends on model size and infrastructure
Speech synthesis TTS (TTFA) 75–300 ms Subject of this guide
Network + connection WebSocket vs HTTP overhead 20–100 ms per turn Eliminated with persistent WebSocket

For conversation to feel natural, the full pipeline should complete in under 800 ms. With STT at 150 ms and LLM at 250 ms, the TTS budget is approximately 200–400 ms, depending on network conditions. This is why TTS APIs with TTFA above 300 ms create a structural problem in production voice agents, regardless of audio quality.

With streaming LLMs that emit the first token quickly, the TTS API can begin synthesizing before the full LLM response is available. This LLM–TTS interleaving reduces effective end-to-end latency, but requires the TTS API to support streaming input (receiving and synthesizing text incrementally as tokens arrive). All providers compared in this guide support streaming input; not all use streaming transport.

Why Do P50 and P99 Both Matter?

Published latency benchmarks typically report P50 (median): the latency at or below which 50% of requests complete. P50 represents average-case performance and is the figure providers tend to lead with.

For production systems, P99 (the latency at or below which 99% of requests complete) is the operationally relevant metric. P99 determines the worst 1% of user experiences, which in high-volume deployments represents thousands of interactions per day.

An API with a 90 ms P50 but a 600 ms P99 produces frequent conversation-breaking pauses in production. An API with a 250 ms P50 and a 280 ms P99 delivers highly consistent performance. For voice agents, P99 consistency often matters more than headline P50 figures, and tail latency is where architectural choices (transformer vs State Space Model, batched vs streaming inference) show up most clearly.

How Does WebSocket Streaming Affect TTS Latency?

Beyond model inference time, the transport layer contributes latency on every request.

  • HTTP chunk transfer encoding opens a new TCP connection per request. The connection handshake (TCP + TLS) adds approximately 40–100 ms per turn, accumulated across every exchange in a conversation. In a 10-turn conversation, this adds 400–1,000 ms of cumulative overhead that does not appear in published TTFA benchmarks (which typically measure a single request).
  • WebSocket-based streaming maintains a persistent bidirectional connection. After the initial handshake, subsequent turns incur no connection overhead. For multi-turn voice agents, WebSocket architecture is a significant latency advantage that does not show up in single-request benchmarks.
  • WebSocket multiplexing takes this further: reusing a single connection concurrently across multiple streams, further reducing per-turn overhead. Gradium supports WebSocket multiplexing, reducing its effective TTFA to 214 ms in multiplexed configurations (vs 228 ms in standard 32-codebook configuration).

ElevenLabs Flash v2.5

ElevenLabs Flash v2.5 was explicitly designed for real-time conversational AI use cases as an alternative to ElevenLabs Multilingual v2, which targets audio quality over latency (1,232 ms P50 on the Coval benchmark). On the same independent benchmark, Flash v2.5 achieves 288 ms P50 TTFA.

Latency Profile

  • TTFA: 288 ms P50 (Coval benchmark)
  • Transport: HTTP chunk transfer encoding (not a persistent WebSocket)
  • P99: not publicly published

What Is the Latency Tradeoff?

ElevenLabs Flash v2.5 (288 ms P50) is slower than Cartesia Sonic-3 (188 ms P50) and Gradium (155 ms P50) on independent measurement. In a pipeline where STT contributes 150 ms and LLM first-token contributes 250 ms, the difference between 155 ms and 288 ms TTS TTFA is 133 ms of total pipeline latency, enough to push a borderline pipeline over the 800 ms naturalness threshold.

The transport layer compounds the gap: ElevenLabs Flash v2.5 uses HTTP streaming, which adds connection overhead per turn (40–100 ms) in multi-turn conversations. This overhead is not captured in single-request TTFA benchmarks but is material in production agents.

Languages and Pricing

32 languages, matching ElevenLabs' broader Multilingual catalogue. Pricing is credit-based at 0.5 credits per character; the effective per-character cost depends on the subscription tier.

Best For

ElevenLabs Flash v2.5 is the right choice when ElevenLabs' voice catalogue is a hard requirement, broad language coverage matters, and the use case is low to medium turn-frequency (where per-request connection overhead is less material). For a deeper Gradium-vs-ElevenLabs breakdown, including TTS, STT, voice cloning, and deployment, see the ElevenLabs alternative comparison.

Cartesia Sonic-3

Cartesia built Sonic-3 on State Space Model (SSM) architecture. SSMs are inherently more efficient for sequential token generation than standard transformer architectures, which translates to more predictable latency distribution, particularly at P99.

Latency Profile

  • TTFA: 188 ms P50 (Coval benchmark)
  • Transport: WebSocket and REST (both available)
  • P99 consistency: SSM architecture produces low-variance latency. The gap between P50 and P99 is a key differentiator of Sonic-3 vs transformer-based models.

Why Does P99 Matter for Sonic-3?

Transformer-based TTS models can exhibit latency spikes under load or for certain input patterns (longer inputs, unusual phoneme sequences, dense punctuation). SSM architecture reduces these variance sources. In production deployments running thousands of concurrent sessions, P99 consistency determines the frequency of conversation-breaking pauses.

Cartesia does not publish specific P99 figures, but the SSM architecture rationale for low-variance latency is documented and shows up in production reports.

Languages and Pricing

40+ languages with regional accent variants. Pricing is approximately $0.03 per minute, with Pro ($4/month), Startup ($39/month), and Scale ($239/month) tiers (annual billing).

Best For

Cartesia Sonic-3 is the right choice when P99 consistency (not just P50) is the primary latency requirement. The SSM architecture makes it particularly suited to high-concurrency deployments where tail latency impacts a meaningful share of user interactions. For a side-by-side Gradium-vs-Cartesia breakdown, see the Cartesia alternative comparison.

Gradium

Gradium was founded by Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez, co-founders of Kyutai. Kyutai released Moshi (real-time Speech-To-Speech) and Hibiki (live Speech-To-Speech translation). Gradium's latency profile is distinct from the other providers in this comparison because it exposes configurable precision–latency tradeoffs through its codebook architecture, derived from Kyutai's Delayed Streams Modeling research.

Latency Profile

On the independent Coval benchmark, Gradium achieves 155 ms P50 TTFA in its default API configuration. On Gradium's own published TTFA methodology (Paris, 15–25 word sentence, WebSocket, 100 queries, warm), Gradium measures P50 258 ms and P95 274 ms end-to-end, or P50 214 ms and P95 228 ms excluding connection establishment.

Gradium also exposes configurable precision–latency tradeoffs through Residual Vector Quantization with a configurable number of codebooks. More codebooks produce higher audio fidelity but increase TTFA. Fewer codebooks reduce TTFA at the cost of some audio resolution. The codebook count is set per session via json_config, see the json_config guide for the full configuration surface.

Codebook config TTFA (self-reported) Audio-to-real-time ratio Recommended deployment
8 codebooks 160 ms 7.71x Notifications, alerts, high-frequency turns
16 codebooks 185 ms 6.16x High-volume production deployments
32 codebooks 228 ms 4.39x Premium voice agents, brand voices

WebSocket Multiplexing

Gradium's TTS API is WebSocket-native. Connection multiplexing reuses a single persistent connection across multiple conversation turns, reducing effective TTFA to 214 ms in standard configuration by eliminating per-turn connection overhead.

In a 10-turn conversation with 50 ms connection overhead per turn, multiplexing saves approximately 450 ms of accumulated latency compared to HTTP-per-request architectures. This saving is invisible in single-request benchmarks but material in production multi-turn agents.

CUDA Graph Optimization

Gradium's inference stack uses CUDA Graph optimization, which reduces model inference overhead on NVIDIA GPUs (L4, A10, H100). This contributes to low variance between P50 and P90 latency figures, and is part of why Gradium's published end-to-end P95 (274 ms) sits only 16 ms above its P50.

Total Pipeline Latency

With a streaming LLM (GPT-4 Turbo, Claude), Gradium's full pipeline (STT + LLM + TTS) achieves 420–520 ms total turn latency, compared to 2.5–5.5 seconds with non-streaming architectures. The WebSocket-native design enables LLM–TTS interleaving: TTS synthesis begins as the LLM emits its first tokens, before the full response is available. For an end-to-end voice agent walkthrough, see how to build a voice AI agent with Gradium and LiveKit.

Languages

English, French, Spanish, German, Portuguese, with regular updates. Mid-sentence code-switching is supported across all five languages with no latency penalty.

Pricing

Credit-based plans starting at $0/month for the free tier (45,000 credits, roughly 1 hour of TTS) and scaling to $1,615/month on Plan L. See Gradium pricing for the full plan breakdown.

Best For

Gradium offers the best balance between configurable low latency, WebSocket-native streaming, and unified TTS + STT infrastructure. Teams that need to tune latency vs audio quality per deployment (different codebook configs for different endpoints) benefit from Gradium's precision controls. The multiplexing advantage is particularly valuable in high-turn-frequency voice agent deployments. For a deeper dive on Gradium for voice agents specifically, see the best Text-To-Speech API for voice agents.

Deepgram Aura-2

Deepgram Aura-2 is a low-latency TTS model integrated into Deepgram's STT-first platform. It is designed to pair with Deepgram Nova-3 in voice agent pipelines.

Latency Profile

  • TTFA: 313 ms P50 on the Coval benchmark
  • Transport: WebSocket
  • Languages: 7 (English, Spanish, French, German, Dutch, Italian, Japanese)

Deepgram's Voice Agent API uses Aura-2 as the TTS component within a bundled STT + LLM + TTS orchestration endpoint, reducing integration overhead for teams already using Deepgram Nova.

Pricing

$0.030 per 1,000 characters ($0.027 at Growth tier).

Best For

Deepgram Aura-2 is the right choice for teams already using Deepgram Nova-3 for STT who want to consolidate latency-critical infrastructure on a single vendor. HIPAA-compliant on-premise deployment is available. For a side-by-side Gradium-vs-Deepgram breakdown, including TTS, voice cloning, and deployment, see the Deepgram alternative comparison.

How Should You Match TTS Latency to Your Use Case?

Not all real-time applications have the same latency tolerance. The table below maps TTFA ranges to application requirements.

TTFA range Perception Suitable use cases Providers in this range
Under 100 ms Imperceptible delay Ultra-low latency voice agents, gaming real-time dialogue No major provider benchmarked in this range by the Coval benchmark (2026)
100–200 ms Not perceived as delay in conversation Voice agents, AI phone calls, interactive assistants Gradium (155 ms P50 Coval), Gradium 8-codebook self-reported (160 ms), Gradium 16-codebook self-reported (185 ms), Cartesia Sonic-3 (188 ms P50 Coval)
200–300 ms At threshold of perceptible pause Voice agents (acceptable), content generation Gradium multiplexed WS self-reported (214 ms), Gradium 32-codebook self-reported (228 ms), ElevenLabs Flash v2.5 (288 ms P50 Coval)
300 ms+ Perceptible pause Content creation, batch narration, non-real-time Deepgram Aura-2 (313 ms P50 Coval), ElevenLabs Multilingual v2 (1,232 ms P50 Coval), most general-purpose TTS

How Should You Choose a Low-Latency TTS API?

  • Choose ElevenLabs Flash v2.5 if ElevenLabs' voice quality and broad language catalogue (32 languages) are primary criteria alongside competitive latency. Plan for HTTP streaming overhead in multi-turn deployments.
  • Choose Cartesia Sonic-3 if P99 consistency matters as much as P50, or if broad language coverage (40+) is required alongside low latency. The SSM architecture delivers the most predictable tail latency in this comparison.
  • Choose Gradium if you need the lowest independently benchmarked TTFA (155 ms P50 on the Coval benchmark), configurable TTFA via codebook config (160–228 ms self-reported), WebSocket-native streaming with multiplexing, unified TTS + STT infrastructure, and the ability to tune precision vs latency per deployment context. The full pipeline latency of 420–520 ms (STT + LLM + TTS) with streaming LLMs is among the lowest in production voice agent stacks.
  • Choose Deepgram Aura-2 if you are already on Deepgram Nova for STT and want consistent low-latency TTS without adding a new vendor.

Also comparing Cartesia, ElevenLabs, or Deepgram head-to-head with Gradium? Each comparison goes deeper on TTS quality, STT, voice cloning, and deployment options.

Frequently Asked Questions