Which STT API has the lowest WER in the 2026 Coval benchmark?

Gradium STT records the lowest average WER in this benchmark at 2.4 percent, with a standard deviation of 5.1. ElevenLabs Scribe v2 is second at 3.1 percent WER. AssemblyAI Universal Streaming is third at 4.2 percent. Deepgram Nova 3 and Nova 2 record 25.3 percent and 25.2 percent respectively. Source: Coval STT benchmark at benchmarks.coval.ai/stt, n=2,400 runs per model.

Which STT API has the fastest TTFT in this benchmark?

Deepgram Nova 3 is the fastest model, used as the baseline in the Coval latency rankings. Its median TTFT is 992 ms with a standard deviation of 125 ms across 2,400 runs. Deepgram Nova 2 records an identical median (992 ms) with higher variance (Std 210 ms).

Is Deepgram's STT accuracy really 25 percent WER?

In this Coval benchmark, Deepgram Nova 3 records 25.3 percent average WER and Nova 2 records 25.2 percent WER across 2,400 runs each. This is a specific benchmark result under specific production conditions and should not be interpreted as Deepgram's accuracy on all input types or use cases. The benchmark is designed for voice agent applications and the WER figures reflect performance on those inputs. The result is consistent between Nova 3 and Nova 2, which adds confidence to the finding.

Why is there a tradeoff between TTFT and WER in STT for voice agents?

Fast STT models optimize for returning transcription tokens quickly, which requires different architectural choices than optimizing for accuracy across varied, complex inputs. Models like Deepgram Nova 3 are designed for real-time transcription speed, which produces low TTFT at the cost of accuracy on difficult inputs. Models like Gradium STT or ElevenLabs Scribe v2 prioritize transcription quality across the full distribution of voice agent inputs, which requires more processing time.

What is TTFT in STT and why does it matter for voice agents?

TTFT (Time to First Token) measures how quickly an STT model returns the first transcribed tokens after receiving audio input. In a voice agent pipeline (STT transcribes, LLM generates a response, TTS synthesizes speech), the STT TTFT is the first step contributing to the user's perceived wait time. A model with 2,080 ms TTFT (ElevenLabs Scribe v2) adds over two seconds to the start of the response chain before the LLM has seen a single word. For live conversational voice agents, TTFT is a direct latency constraint.

What does WER standard deviation tell us about an STT model?

WER standard deviation measures how consistently a model performs across different inputs. A low average WER with a high standard deviation means the model is very accurate on some inputs and poor on others, which is harder to reason about in production than a slightly higher but stable WER. In this benchmark, Gradium STT records the lowest WER standard deviation (5.1), meaning its 2.4 percent average WER is stable across varied inputs. Deepgram Nova 3 records a 30.1 WER standard deviation against a 25.3 percent average, indicating large variation run to run.

Does Gradium's STT include semantic VAD?

Yes. Gradium's STT API includes semantic voice activity detection, which determines when a speaker has finished a complete thought rather than just stopped making sound. Semantic VAD is not measured in this TTFT/WER benchmark but is relevant for the quality of turn-taking in live voice agents. It is included in all Gradium plans as part of the integrated TTS plus STT platform.

How does this STT benchmark compare to the Coval TTS benchmark?

The Coval TTS benchmark (May 4, 2026) measures TTS performance: TTFA (Time to First Audio), WER of the synthesized speech, and latency IQR. The STT benchmark measures the inverse: TTFT (Time to First Token from audio input) and WER of the transcription. Both use Coval's production benchmark infrastructure. The metrics are not directly comparable because they measure different steps in the voice pipeline, but together they characterize the full latency and accuracy profile of a provider's voice stack.

What is the latency vs accuracy tradeoff in STT for voice agents?

The Coval benchmark shows that no single STT model leads on both latency and accuracy. The fastest models (Deepgram Nova 3 and Nova 2, 992 ms median TTFT) record the highest WER (25 percent+). The most accurate model (Gradium STT, 2.4 percent WER) sits at 1,560 ms median TTFT. AssemblyAI Universal Streaming sits in the middle (1,061 ms TTFT, 4.2 percent WER). The right choice depends on whether the use case tolerates 25 percent WER for the latency savings, or requires accurate transcription at the cost of higher TTFT.

Does Gradium offer a free plan to test STT?

Yes. The free plan is zero dollars per month with 45,000 credits, covering both TTS and STT. STT uses 3 credits per second, so the free plan provides approximately 4 hours of STT per month for evaluation and non-commercial use. Paid plans start at 13 dollars per month for the XS tier.

Where can I find the live Coval STT benchmark?

The live Coval STT benchmark dashboard is at benchmarks.coval.ai/stt. It refreshes continuously as new runs are recorded, so the values shown track current production performance rather than a frozen snapshot. The figures in this article are from the May 2026 snapshot referenced in the source data.

Gradium was co-founded by Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez, who previously co-founded Kyutai. Kyutai released world-first open systems including Moshi (real-time speech-to-speech) and Hibiki (live speech-to-speech translation).

STT API Benchmark 2026: Latency and Accuracy for Voice Agents

Choosing a Speech-To-Text API for a production voice agent involves a tradeoff that is not visible in marketing materials: the fastest models on latency are often the least accurate, and the most accurate models are often the slowest. Understanding where each provider sits on that curve, under real production conditions, is what determines which STT model is the right fit for a specific use case.

This article covers an independent Coval benchmark comparing five Speech-To-Text APIs on Time to First Token (TTFT) and Word Error Rate (WER) for voice agent applications: Gradium STT, Deepgram Nova 3, Deepgram Nova 2, AssemblyAI Universal Streaming, and ElevenLabs Scribe v2. Each model was evaluated across 2,400 runs. The benchmark exposes a structural tradeoff in the 2026 STT market: the two fastest models on latency (Deepgram Nova 3 and Nova 2) record the highest WER by a significant margin (25.2 to 25.3%), while the most accurate model (Gradium STT, 2.4% WER) sits further down the latency distribution. There is no single provider that leads on both dimensions simultaneously.

Why Are STT Benchmarks for Voice Agents Different?

STT APIs for voice agents are evaluated differently from transcription APIs for batch processing. In batch transcription, the only metric that matters is accuracy: the job runs offline, latency is irrelevant, and the output is a text file reviewed by a human.

In a live voice agent, two metrics define production quality:

Time to First Token (TTFT): how quickly does the STT model return the first transcribed tokens after the user finishes speaking? In a turn-based voice agent architecture (user speaks, STT transcribes, LLM generates, TTS synthesizes), TTFT is the first step in the response latency chain. A slow STT model adds directly to the time the user waits before hearing a response.
Word Error Rate (WER): how accurately does the model transcribe what the user said? A mispronounced or incorrect transcription produces a bad LLM input, which produces a bad agent response. In voice agents handling names, numbers, addresses, or product codes, WER is a direct proxy for failure rate on structured data.

Both metrics matter. A model with low WER but 2-second TTFT will feel broken in conversation. A model with sub-second TTFT but 25% WER will misinterpret one in four words, making the agent unreliable in any context beyond simple commands.

What Is the Benchmark Setup?

The Coval benchmark measures STT performance for voice agent applications under production conditions. Key parameters:

Models tested: Gradium STT, Deepgram Nova 3, Deepgram Nova 2, AssemblyAI Universal Streaming, ElevenLabs Scribe v2
Runs per model: 2,400
Primary latency metric: TTFT (Time to First Token), measured in milliseconds
Primary accuracy metric: average WER (Word Error Rate), lower is better
Latency ranking: relative delta from Deepgram Nova 3 (baseline, fastest model)

Note: statistical outliers above 2.3 seconds are excluded from the latency distribution charts (the Coval dashboard refreshes continuously and the exclusion count varies snapshot to snapshot).

What Are the Latency Results?

Performance Rankings (Latency Delta from Nova 3 Baseline)

The latency table uses Deepgram Nova 3 as the baseline (position 1, fastest). All other models are expressed as deltas relative to Nova 3.

Rank	Model	Provider	P25 Delta	P50 Delta	P75 Delta
1	Nova 3	Deepgram	baseline	baseline	baseline
2	Nova 2	Deepgram	+19 ms	+22 ms	+25 ms
3	Universal Streaming	AssemblyAI	+52 ms	+57 ms	+67 ms
4	Gradium STT	Gradium	+593 ms	+596 ms	+601 ms
5	Scribe v2	ElevenLabs	+1,040 ms	+1,053 ms	+1,064 ms

Source: Coval STT benchmark. n=2,400 runs per model. Latency deltas relative to Deepgram Nova 3 (fastest baseline).

Absolute TTFT Distribution (Latency Variation)

Model	Provider	Mean TTFT	Median TTFT	Std Dev	Runs
Nova 3	Deepgram	1,020 ms	992 ms	125 ms	2,400
Nova 2	Deepgram	1,042 ms	992 ms	210 ms	2,400
Universal Streaming	AssemblyAI	1,079 ms	1,061 ms	237 ms	2,400
Gradium STT	Gradium	1,617 ms	1,560 ms	409 ms	2,400
Scribe v2	ElevenLabs	2,073 ms	2,080 ms	60 ms	2,400

Source: Coval STT benchmark. Statistical outliers above 2.3 s excluded from distribution.

How Should You Read the Latency Data?

Deepgram Nova 3 and Nova 2 sit at the same median TTFT (992 ms) but differ in standard deviation: Nova 3 at 125 ms Std vs Nova 2 at 210 ms Std. Nova 3 is both faster on average and more consistent in its latency distribution.

AssemblyAI Universal Streaming is 57 ms slower at P50 than Nova 3, with a 237 ms standard deviation, the highest of the three faster models. The tail latency is wider than either Deepgram model.

Gradium STT records a median TTFT of 1,560 ms, placing it 596 ms behind Nova 3 at P50. The standard deviation of 409 ms is the second highest in the benchmark, indicating meaningful latency variation across runs. The Delta IQR (interquartile range of the latency delta) is 8 ms, however, which reflects the core of the distribution rather than the tails.

ElevenLabs Scribe v2 is the slowest model on TTFT (median 2,080 ms, mean 2,073 ms) but records the lowest standard deviation of all five models (60 ms). Its latency is slow but highly predictable: the distribution is narrow and compact.

What Are the Accuracy Results?

Accuracy by Model

Model	Provider	Avg WER	WER Std Dev
Gradium STT	Gradium	2.4%	5.1
Scribe v2	ElevenLabs	3.1%	5.7
Universal Streaming	AssemblyAI	4.2%	6.1
Nova 2	Deepgram	25.2%	29.1
Nova 3	Deepgram	25.3%	30.1

Source: Coval STT benchmark. n=2,400 runs per model.

How Should You Read the Accuracy Data?

The WER ranking produces a different ordering than the latency ranking. Gradium STT records the lowest WER (2.4%) and ElevenLabs Scribe v2 the second lowest (3.1%). AssemblyAI Universal Streaming follows at 4.2%. The top three on accuracy are the bottom three on latency speed.

The most striking finding is the Deepgram WER: Nova 3 records 25.3% WER and Nova 2 records 25.2%, in the same benchmark. Despite being the fastest models by a significant margin, both Deepgram models produce roughly one word error for every four words transcribed in this benchmark. This is a tradeoff that is invisible when evaluating TTFT alone.

The WER standard deviation also reveals consistency differences. Gradium records 5.1 WER standard deviation, the lowest of all five models. ElevenLabs Scribe v2 records 5.7. The Deepgram models record 29.1 and 30.1, indicating that their WER is not only high on average but highly variable across runs.

How Do All Five Models Compare in One Heatmap?

Model	P25 Delta (ms)	P50 Delta (ms)	P75 Delta (ms)	Delta IQR (ms)	Avg WER	WER Std Dev
Gradium STT	593	596	601	8	2.4%	5.1
Deepgram Nova 2	19	22	25	6	25.2%	29.1
Deepgram Nova 3	0	0	0	0	25.3%	30.1
ElevenLabs Scribe v2	1,040	1,053	1,064	25	3.1%	5.7
AssemblyAI Universal Streaming	52	57	67	15	4.2%	6.1

Source: Coval STT benchmark. Latency deltas relative to Deepgram Nova 3 (baseline). n=2,400 runs per model.

The heatmap makes the fundamental tradeoff visible in one view. No single model leads on both latency and accuracy. The fastest models (Deepgram Nova 3 and Nova 2) record the worst WER. The most accurate model (Gradium STT, 2.4% WER) has the highest latency delta (+596 ms P50 vs baseline). The middle-ground models (AssemblyAI Universal Streaming at 4.2% WER and +57 ms P50) represent a different balance point.

How Does the Latency vs Accuracy Tradeoff Play Out in Practice?

For a voice agent where the user is waiting for a spoken response, the total perceived latency is the sum of: STT processing time, LLM response generation, and TTS time to first audio. The STT step is only one component of the full pipeline.

If the STT model contributes 992 ms (Nova 3) vs 1,560 ms (Gradium STT), the difference is 568 ms. Whether that difference is decisive depends on what the other pipeline components contribute. In a pipeline where TTS adds 155 ms (Coval TTS benchmark, Gradium TTS) and LLM adds 300 to 500 ms, the total latency gap from choosing Gradium STT over Deepgram Nova 3 represents a meaningful but not disqualifying share of total response time.

If the STT model contributes 25% WER (Nova 3) vs 2.4% WER (Gradium STT), the difference is 22.9 percentage points. In a voice agent handling real-world inputs, a 25% WER means the LLM receives wrong input roughly one time in four. For a customer support agent handling order numbers, appointment times, or billing questions, this produces incorrect responses that no amount of latency optimization can fix.

The decision framework:

Prioritize TTFT when the voice interaction is command-and-control (short, simple inputs), total pipeline latency is the primary user experience constraint, and input content is predictable and high-confidence.
Prioritize WER when the voice agent handles names, numbers, addresses, dates, product codes, or any structured data where transcription accuracy directly determines response quality. For these use cases, a 25% WER is not an acceptable tradeoff for saving 568 ms.
Middle ground: AssemblyAI Universal Streaming (4.2% WER, +57 ms P50) sits closest to the upper-left of the latency-accuracy curve: relatively fast and meaningfully more accurate than the Deepgram models.

What Is Each Model's Profile?

Gradium STT

TTFT: median 1,560 ms, Std Dev 409 ms, Delta IQR 8 ms
WER: 2.4% avg, Std Dev 5.1 (lowest of all models on both metrics)
Best for: voice agents where transcription accuracy on structured inputs (numbers, names, dates, addresses) is a hard requirement and total pipeline latency is manageable
Gradium STT also includes semantic voice activity detection (VAD) for natural turn-taking in voice agents, not measured in this benchmark

Deepgram Nova 3

TTFT: median 992 ms, Std Dev 125 ms, Delta IQR 0 (benchmark baseline)
WER: 25.3% avg, Std Dev 30.1 (highest of all models)
Best for: voice applications where speed is the primary constraint and input content is simple enough that high WER is acceptable (basic commands, keyword detection, navigation)

Deepgram Nova 2

TTFT: median 992 ms, Std Dev 210 ms, Delta IQR 6 ms
WER: 25.2% avg, Std Dev 29.1
Very similar profile to Nova 3 on both metrics. Marginally lower WER (25.2% vs 25.3%), slightly higher latency variance (Std 210 ms vs 125 ms)

AssemblyAI Universal Streaming

TTFT: median 1,061 ms, Std Dev 237 ms, Delta IQR 15 ms
WER: 4.2% avg, Std Dev 6.1
Best for: applications requiring a balance between acceptable latency and meaningfully better accuracy than Deepgram, where the 4.2% WER is sufficient for the use case

ElevenLabs Scribe v2

TTFT: median 2,080 ms, Std Dev 60 ms (most consistent latency of all models), Delta IQR 25 ms
WER: 3.1% avg, Std Dev 5.7
Second-most accurate model in the benchmark. Slowest on TTFT by a significant margin (1,088 ms behind Nova 3 at P50). The extremely low Std Dev (60 ms) makes its latency highly predictable, but 2,080 ms median TTFT makes it unsuitable for real-time conversational voice agents

What Does This Mean for STT API Selection in 2026?

This Coval benchmark surfaces a structural split in the current STT market for voice agents. There is no single model that leads on both TTFT and WER across the models tested.

For production voice agents handling structured data: Gradium STT (2.4% WER) and ElevenLabs Scribe v2 (3.1% WER) are the two most accurate models in this benchmark. Scribe v2's 2,080 ms median TTFT makes it unsuitable for real-time applications. Gradium STT at 1,560 ms median TTFT is slower than the Deepgram models but meaningfully faster than Scribe v2, and pairs with TTS and semantic VAD in the same platform.

For simple command-and-control voice applications: Deepgram Nova 3 (fastest, 992 ms median TTFT) is the right choice if the use case tolerates 25%+ WER. Nova 3 and Nova 2 are statistically similar on both metrics in this dataset.

For a balance between speed and accuracy: AssemblyAI Universal Streaming (1,061 ms median TTFT, 4.2% WER) is 69 ms slower than Nova 3 at median and approximately 6x more accurate on WER (4.2% vs 25.3%). It is the closest model to the upper-left corner of the latency-accuracy curve in this benchmark.

On consistency: ElevenLabs Scribe v2 has the most predictable latency (Std Dev 60 ms) but the slowest absolute TTFT. Gradium STT has the smallest WER standard deviation (5.1), meaning its transcription accuracy is the most consistent across varied inputs. Deepgram's WER Std Dev (29 to 30) indicates that its accuracy fluctuates significantly between runs.

Where Does Gradium STT Sit in a Full Voice Pipeline?

Gradium's STT is part of the same platform as its TTS and voice cloning APIs. For developers building a complete voice agent pipeline (STT + LLM + TTS), running TTS and STT on the same platform simplifies integration: one API key, one billing relationship, one WebSocket architecture for both directions. See the flagship voice-agent TTS guide for the matching TTS coverage.

Gradium STT includes semantic VAD, which is not measured in this TTFT/WER benchmark but is relevant for voice agent quality. Semantic VAD determines when a user has finished a complete thought rather than just stopped making sound, enabling natural turn-taking without silence-threshold cut-offs. See Turn-Taking in Voice Agents for a deeper treatment of the turn-taking problem.

Gradium STT pricing:

S plan: $43/month, 83 hours included (~$0.518/hr)
M plan: $340/month, 833 hours included (~$0.408/hr)
L plan: $1,615/month, 4,167 hours included (~$0.388/hr)

For full pricing details, see gradium.ai/pricing. STT and TTS share the same monthly plan credits.

How Should You Frame the STT Choice in 2026?

The Coval STT benchmark on five models across 2,400 runs each reveals a clear split in the 2026 STT market for voice agents: the fastest models on TTFT (Deepgram Nova 3 and Nova 2, median 992 ms) record the highest WER (25.2 to 25.3%), while the most accurate model (Gradium STT, 2.4% WER) is slower on TTFT (median 1,560 ms).

For voice agents handling structured data, the WER gap between Deepgram (25%+) and Gradium (2.4%) represents a production failure rate difference that latency savings cannot compensate for. For simple voice applications where input is predictable and speed is the primary constraint, Deepgram Nova 3's 992 ms median TTFT and consistent distribution make it the right choice.

AssemblyAI Universal Streaming (1,061 ms median TTFT, 4.2% WER) occupies the middle of the latency-accuracy curve and represents a viable balance point for use cases where neither extreme is optimal. ElevenLabs Scribe v2 (3.1% WER, 2,080 ms TTFT) is the second-most accurate model but too slow for real-time conversational use.

For developers building a full voice pipeline, Gradium's combination of lowest STT WER (2.4%) in this benchmark, semantic VAD for natural turn-taking, and integration with a TTS that records 155 ms TTFA P50 in the Coval TTS benchmark in a single API provides a consistent set of production benchmarks across both directions of the voice stack.