Coval is an independent voice and chat AI evaluation platform, founded in 2024 by Brooke Hopkins (formerly Waymo). It provides public benchmarks for TTS and STT providers at benchmarks.coval.ai, as well as a commercial platform for simulation, continuous monitoring, and CI/CD evaluation of voice agents in production.

Which TTS API ranks first on Coval's benchmarks?

As of May 13, 2026, Gradium ranks first on Coval's TTS benchmarks across three metrics: P50 Time to First Audio (TTFA), latency range (P25-P75 spread), and Word Error Rate (WER). Results are public at benchmarks.coval.ai/tts.

What is Time to First Audio (TTFA)?

TTFA is the elapsed time from when a TTS request is sent to until the first streamed chunk containing audio samples is received from the API. It differs from TTFB (Time to First Byte) because the initial bytes from a TTS API are often container headers rather than audible audio. For voice agents, TTFA is the conversational-latency metric that matters.

How is WER measured for TTS?

WER for TTS measures intelligibility. The synthesized audio is transcribed with a reference STT model and the transcript is compared to the input text. The error rate captures dropped words, mispronunciations, and hallucinations. WER on TTS is distinct from WER on STT, which measures transcription accuracy on real human speech.

How does Gradium compare to ElevenLabs and Cartesia?

On Coval's May 2026 TTS benchmarks, Gradium ranks ahead of ElevenLabs Flash v2.5 and Cartesia Sonic-3 on P50 TTFA, latency range, and WER. ElevenLabs uses a first-chunk optimization that returns a small initial audio chunk to minimize TTFB, trading latency consistency for that metric, which shows up in the P25-P75 spread. Cartesia Sonic-3 also shows wider spread than Gradium on Coval's benchmark.

Where can I run my own TTS evaluations?

The Coval benchmark code is open source at github.com/coval-ai/benchmarks. It can be run against any TTS provider with custom inputs and test scenarios.

Gradium #1 on Coval TTS Benchmarks

Gradium and Coval logos: Gradium ranks #1 on Coval TTS benchmarks

In natural conversation, the gap between one person finishing a sentence and the other starting to respond averages around 200 milliseconds.

For voice agents built on cascaded pipelines (STT, LLM, tool calls, TTS), TTS is the last stage before the user hears anything, and it inherits every upstream delay.

The latency budget for TTS itself is typically 200-300ms Time to First Audio (TTFA).

As of May 13, 2026, Gradium ranks first on Coval's TTS benchmarks across all latency metrics: P50 TTFA, latency range (P25-P75 spread). This is not at the expense of quality metrics: Gradium provides state of the art Word Error Rate (WER).

Coval TTS Time to First Audio (TTFA) distribution plot (May 2026): Gradium has the lowest TTFA at 158ms P50 with the tightest distribution, followed by Cartesia Sonic-3, ElevenLabs Turbo v2.5, Rime Mist-v3, ElevenLabs Flash v2.5, Deepgram Aura-2, Rime Arcana, ElevenLabs Multilingual v2, and OpenAI TTS-1-HD. 827 statistical outliers above 3.8s not shown. — Source: Coval TTS benchmark, retrieved on May 13, 2026 at 10:30 CET.

Coval is an independent voice AI evaluation platform (YC S24); its TTS benchmark results are published at benchmarks.coval.ai/tts. Gradium develops audio language models that power text-to-speech, speech-to-text, and voice cloning through a single API. This post covers the methodology and the results of Gradium, ElevenLabs flash_v2.5 and Multilingual v2, Cartesia sonic-3, Rime arcana, and OpenAI.

What Coval is and why its TTS benchmarks matter

Coval was founded in 2024 by Brooke Hopkins, who previously built evaluation job infrastructure for self-driving at Waymo. The platform applies the testing rigor self-driving teams developed over a decade (simulation-driven testing, continuous benchmarking, trajectory-based analysis) to voice and chat AI evaluation.

Most TTS benchmarks come from the vendors themselves. The conditions get picked to flatter the system being measured: studio-quality reference text, simple inputs, controlled load, P50 numbers without spread. The measurements are real, but they are not comparable across providers, and they do not necessarily reflect production conditions.

Coval's approach is the opposite: standardized conditions across providers, continuously updated, with methodology, and code made public. For voice agent teams, this is the closest thing the industry has to an apples-to-apples TTS comparison.

What Coval measures today: TTFA, latency range, and TTS WER

The TTS suite measures three properties that determine whether a voice agent works in production.

Time to First Audio (TTFA) is the time from when a request is sent to until the first streamed chunk containing audio samples is received from the API. TTFA is reported at P50, P75, P95, and P99. For deeper methodology, see Time to First Audio: Measuring and Reducing TTS Latency in Voice Agents.

Latency range is the P25-P75 spread of TTFA across requests. Conversational latency is governed by both the worst case and the median. A model with 200ms P50 and 600ms P75 will feel worse than one with 250ms P50 and 300ms P75.

Word Error Rate (WER) for TTS is an intelligibility metric. The synthesized audio is transcribed with a reference STT model and the resulting transcript is compared to the input text. Low WER means the model articulates correctly, handles complex pronunciations cleanly, and does not hallucinate or drop words. WER on TTS is distinct from WER on STT, which measures transcription accuracy on real human speech.

TTS benchmark results: Gradium vs. ElevenLabs, Cartesia, Rime, and OpenAI

Gradium ranks first overall, as of May 13th, 2026. That means that the Gradium model is SOTA on all measurable metrics, consistent with their mission to provide low latency voice models that scale.

Methodology: Coval's benchmark methodology and code are open source at github.com/coval-ai/benchmarks. The full input dataset is published and covers the kinds of utterances voice agents handle in production. Two examples:

"There's a slight delay with your $347.89 order, but we expect it to ship by Friday afternoon."

"Hi Ms. Garcia, your appointment with Dr. Peterson is scheduled for Tuesday, March 5th at 10:30 AM."

Coval TTS benchmark results table (May 2026): Gradium ranks first across P25 (157ms), P50 (158ms), P75 (159ms), latency IQR (2ms), and average WER (3.7%) against Cartesia Sonic-3, ElevenLabs Turbo v2.5, Rime Mist-v3, ElevenLabs Flash v2.5, Deepgram Aura-2, Rime Arcana, ElevenLabs Multilingual v2, and OpenAI TTS-1-HD. — Source: Coval TTS benchmark, retrieved on May 13, 2026 at 10:30 CET.

Why Gradium ranks first on Coval's TTS benchmarks

Gradium develops audio language models (ALMs), an audio-native counterpart to LLMs. Rather than training a TTS model and stitching it into a pipeline with bespoke tuning, an ALM is trained on paired audio and text and performs multiple voice tasks within a single architecture. This paradigm was originally introduced by Gradium's founders at Google and Meta and has since become the dominant approach across the industry. Gradium's models are based on Delayed Streams Modeling (DSM, arXiv:2509.08753), which enables batched generation while preserving streaming capabilities.

For more details, see Optimizing Quality vs. Latency in Real-Time TTS.

How to reproduce Coval's TTS benchmarks

Coval's benchmark code is open source: github.com/coval-ai/benchmarks. The suite can be run against any combination of providers, with custom audio data and test scenarios.

For voice agent teams, this is more useful than relying on either vendor-reported numbers or generic public results. Production conditions are specific to your text inputs, your latency budget, your concurrent load.

What Coval's public TTS benchmark doesn't (yet) cover

The public Coval suite covers latency, consistency, and intelligibility. At the moment, it does not yet include:

Languages beyond English. Our internal benchmarks show consistent performance for WER across languages: see The most accurate multilingual text-to-speech, by the numbers.
Voice cloning fidelity. Also benchmarked on the Gradium blog with blind ELO across four languages: Why Your Voice Cloning Sounds Fake (And How to Fix It).

Get started with Gradium TTS and Coval

If you're building a voice agent and want to talk through TTS evaluation, benchmarking methodology, or how Gradium fits into your stack, we'd love to hear from you. Reach out at contact@gradium.ai or visit gradium.ai.

You can also start integrating Gradium right away: Get started.

To run your own TTS evaluations: benchmarks.coval.ai/tts for live results, github.com/coval-ai/benchmarks for the code. For partnership and enterprise enquiries, contact Coval at sales@coval.dev.

References

Gradium, Optimizing Quality vs. Latency in Real-Time TTS AI Models.
Gradium, Why Your Voice Cloning Sounds Fake (And How to Fix It).
Gradium API documentation, pronunciation dictionaries: gradium.ai/api_docs.html.
Coval voice AI benchmarks: benchmarks.coval.ai. Methodology and code: github.com/coval-ai/benchmarks.