← Back to Blog

Time to First Audio: Measuring and Reducing TTS Latency in Voice Agents

March 24, 20264 min read

In natural conversation, the gap between one person finishing a sentence and the other starting to respond averages around 200 milliseconds. For voice agents this is the target to match. When such an agent is built on a cascaded pipeline involving a speech-to-text model (STT), an LLM, and a text-to-speech model (TTS), every millisecond spent in one component is a millisecond unavailable to the others.

This post covers how to measure TTS latency, why cascaded architectures make it especially critical, and how Gradium compares to ElevenLabs, Mistral, and OpenAI on the metric that matters most: Time to First Audio.

What Is Time to First Audio (TTFA) in Text-to-Speech?

The key metric for real-time TTS is Time to First Audio (TTFA): the elapsed time from sending the request to receiving the first playable audio samples. At Gradium, TTFA is one of the core performance metrics we optimize for.

TTFA is straightforward to define but easy to measure incorrectly, because most TTS APIs use streaming responses that begin transmitting data before the full utterance is synthesized. When a streaming TTS API responds, the first bytes are typically some audio container metadata such as WAV headers, Ogg identification and comment pages, or MP3 ID3 tags. These bytes carry no audio content. Naively measuring the time to first byte captures the arrival of this metadata, not the point at which the client can actually start playing sound.

The difference can be significant. A server might respond with headers and container metadata within 50ms, while the first actual audio samples arrive 200ms later. A naive benchmark would report 50ms. The user experiences 250ms.

Detecting the First Real Audio Byte

To measure TTFA accurately, you need to parse past the container format and timestamp the arrival of the first chunk containing encoded audio samples. For a WAV file, discard the initial 44 bytes header, for Ogg/Opus skip the identification and comment header pages. For MP3, skip the ID3 tags and then detect the first valid MPEG audio frame.

The first audio chunks may also contain silence; discarding it would require analyzing the audio itself.

Benchmarking TTS Providers

When benchmarking multiple TTS providers, the following variables need to be controlled:

  • Input text. Use a standardized sentence of 15-25 words, representative of typical voice agent output.
  • Output format and sample rate. Use the same format and sample rate across all providers. Different formats have different encoding overhead (Opus requires compression, PCM does not), and different sample rates may involve upsampling steps. Mixing formats means part of the TTFA difference reflects encoding and transmission costs rather than synthesis performance.
  • Geographic region. All providers should be tested from the same client location. In this case we measured everything from our Paris office.
  • Network latency. Measure the raw TCP ping to each provider's endpoint before benchmarking. If one provider has 5ms ping and another has 40ms, you're measuring network distance, not engine speed. From our office, we measured a ping of ~5ms to our API endpoint and to the ElevenLabs one, and a ping of ~3ms to the OpenAI API.
  • Warm state. It’s possible that TTS APIs use some caches for voices or account information that would make the first queries longer than the usual ones. We ran our benchmark using 100 queries per model and discarded the first 5 queries.

We used websocket APIs in this benchmark as they are the most amenable for real-time use cases. For OpenAI TTS models, we used POST queries as there is no websocket API to our knowledge.

TTS Latency Benchmark: Gradium vs. ElevenLabs vs. Mistral vs. OpenAI

At Gradium, we designed our inference pipeline with TTFA as a primary constraint. Our models are based on the Delayed Streams Modeling (DSM) architecture (arxiv paper) , which enables batched generation while preserving streaming capabilities. Combined with CUDA graph optimization and configurable codebook depth (see our blog post), this allows us to achieve consistently low TTFA across batch sizes.

Model P25 P50 P75 P95
Gradium 255 ms 258 ms 263 ms 274 ms
Eleven Turbo v2.5 294 ms 304 ms 311 ms 324 ms
Eleven Flash v2.5 317 ms 324 ms 333 ms 351 ms
Mistral Voxtral TTS 346 ms 369 ms 400 ms 566 ms
OpenAI GPT4o Mini 400 ms 420 ms 439 ms 483 ms
Eleven Multilingual v2 690 ms 706 ms 720 ms 742 ms
OpenAI TTS1 722 ms 969 ms 1232 ms 1807 ms

The plot below shows the same data excluding TTS1 timings as the latencies are too large to fit in the plot.

Gradium delivers first audio faster than all alternatives while maintaining comparable or better voice quality. This gap directly translates to more headroom in the latency budget for tool calls and LLM reasoning.

How to Optimize End-to-End Voice Agent Latency

Even with a fast TTS engine, architectural choices in the surrounding pipeline can add tens or hundreds of milliseconds.

Colocate Your Pipeline Components

If the different components of your pipeline are in different data centers, this results in additional latencies when sending the data between components. This can add hundreds of milliseconds overall so it’s important to check that the agent coordination server is close to the voice API provider, the LLM, and the end user.

Stream LLM Output into TTS

All Gradium models are streaming: you can feed input one word at a time and receive audio output as soon as it’s available. By streaming LLM tokens directly into TTS, the overall latency is reduced, and a response that takes 2 seconds to fully generate can begin playing audio before the LLM generation is done.

Reduce Connection Overhead with Multiplexing

A voice agent may use a separate websocket connection for each turn of speech. This requires a new connection to be established each time, in our benchmarks this requires of the order of ~50ms.

In order to get around this, one can create a persistent websocket connection and use it to multiplex different sessions for each turn of speech. This persistent connection can even be started when launching the agent rather than on the first turn of speech for optimal latency. On the Gradium API, this is done by connecting to the usual websocket endpoint, for each session send a new setup message with close_ws_on_eos set to false and with a distinct client_req_id. The client request id should be used for all the messages sent to the API and responses will include it back so as to separate the different sessions.

More details can be found in Gradium’s Multiplexing documentation.

The table below shows the TTFA values excluding the connection establishment times.

Model P25 P50 P75 P95
Gradium 212 ms 214 ms 219 ms 228 ms
Eleven Turbo v2.5 248 ms 257 ms 263 ms 278 ms
Eleven Flash v2.5 271 ms 277 ms 284 ms 302 ms
Eleven Multilingual v2 643 ms 657 ms 672 ms 688 ms

Gradium Deployment Options

Depending on your latency requirements, data constraints, and infrastructure, Gradium can be deployed in several ways:

  • Cloud API. The fastest way to get started. Access Gradium TTS through our hosted API, with endpoints in multiple regions.
  • Inference partner deployments. We deploy Gradium’s API with our own infrastructure partners in multiple locations worldwide, letting you colocate with your existing LLM providers.
  • Dedicated instances. Reserved compute with guaranteed capacity and deterministic latency, no shared-infrastructure variance.
  • Self hosted. We also offer private cloud options for self-hosted inference to allow you to manage TTS inference in your own GPU infrastructure.
  • On-premises. For teams with strict data sovereignty or regulatory constraints (healthcare, financial services, …), we deploy Gradium directly within your infrastructure. Audio data never leaves your environment.

Beyond Latency

This post focused on latency because it's the most measurable and often the most neglected bottleneck in voice agent pipelines. But latency alone doesn't make a great voice agent. Production systems also need natural-sounding output and a robust orchestration layer that manages streaming, interruptions, and turn-taking end-to-end. Gradium is designed to fit into that full stack, not just win an isolated benchmark.

If you're building a voice agent and want to discuss which deployment option fits your architecture, we'd love to hear from you. Reach out at contact@gradium.ai or visit gradium.ai.

You can also start integrating Gradium right away: Get started.