STT API Benchmark 2026: Latency and Accuracy for Voice Agents
Choosing a Speech-To-Text API for a production voice agent involves a tradeoff that is not visible in marketing materials: the fastest models on latency are often the least accurate, and the most accurate models are often the slowest. Understanding where each provider sits on that curve, under real production conditions, is what determines which STT model is the right fit for a specific use case.
This article covers an independent Coval benchmark comparing five Speech-To-Text APIs on Time to First Token (TTFT) and Word Error Rate (WER) for voice agent applications: Gradium STT, Deepgram Nova 3, Deepgram Nova 2, AssemblyAI Universal Streaming, and ElevenLabs Scribe v2. Each model was evaluated across 2,400 runs. The benchmark exposes a structural tradeoff in the 2026 STT market: the two fastest models on latency (Deepgram Nova 3 and Nova 2) record the highest WER by a significant margin (25.2 to 25.3%), while the most accurate model (Gradium STT, 2.4% WER) sits further down the latency distribution. There is no single provider that leads on both dimensions simultaneously.
Why Are STT Benchmarks for Voice Agents Different?
STT APIs for voice agents are evaluated differently from transcription APIs for batch processing. In batch transcription, the only metric that matters is accuracy: the job runs offline, latency is irrelevant, and the output is a text file reviewed by a human.
In a live voice agent, two metrics define production quality:
- Time to First Token (TTFT): how quickly does the STT model return the first transcribed tokens after the user finishes speaking? In a turn-based voice agent architecture (user speaks, STT transcribes, LLM generates, TTS synthesizes), TTFT is the first step in the response latency chain. A slow STT model adds directly to the time the user waits before hearing a response.
- Word Error Rate (WER): how accurately does the model transcribe what the user said? A mispronounced or incorrect transcription produces a bad LLM input, which produces a bad agent response. In voice agents handling names, numbers, addresses, or product codes, WER is a direct proxy for failure rate on structured data.
Both metrics matter. A model with low WER but 2-second TTFT will feel broken in conversation. A model with sub-second TTFT but 25% WER will misinterpret one in four words, making the agent unreliable in any context beyond simple commands.
What Is the Benchmark Setup?
The Coval benchmark measures STT performance for voice agent applications under production conditions. Key parameters:
- Models tested: Gradium STT, Deepgram Nova 3, Deepgram Nova 2, AssemblyAI Universal Streaming, ElevenLabs Scribe v2
- Runs per model: 2,400
- Primary latency metric: TTFT (Time to First Token), measured in milliseconds
- Primary accuracy metric: average WER (Word Error Rate), lower is better
- Latency ranking: relative delta from Deepgram Nova 3 (baseline, fastest model)
Note: statistical outliers above 2.3 seconds are excluded from the latency distribution charts (the Coval dashboard refreshes continuously and the exclusion count varies snapshot to snapshot).
What Are the Latency Results?
Performance Rankings (Latency Delta from Nova 3 Baseline)
The latency table uses Deepgram Nova 3 as the baseline (position 1, fastest). All other models are expressed as deltas relative to Nova 3.
| Rank | Model | Provider | P25 Delta | P50 Delta | P75 Delta |
|---|---|---|---|---|---|
| 1 | Nova 3 | Deepgram | baseline | baseline | baseline |
| 2 | Nova 2 | Deepgram | +19 ms | +22 ms | +25 ms |
| 3 | Universal Streaming | AssemblyAI | +52 ms | +57 ms | +67 ms |
| 4 | Gradium STT | Gradium | +593 ms | +596 ms | +601 ms |
| 5 | Scribe v2 | ElevenLabs | +1,040 ms | +1,053 ms | +1,064 ms |
Source: Coval STT benchmark. n=2,400 runs per model. Latency deltas relative to Deepgram Nova 3 (fastest baseline).
Absolute TTFT Distribution (Latency Variation)
| Model | Provider | Mean TTFT | Median TTFT | Std Dev | Runs |
|---|---|---|---|---|---|
| Nova 3 | Deepgram | 1,020 ms | 992 ms | 125 ms | 2,400 |
| Nova 2 | Deepgram | 1,042 ms | 992 ms | 210 ms | 2,400 |
| Universal Streaming | AssemblyAI | 1,079 ms | 1,061 ms | 237 ms | 2,400 |
| Gradium STT | Gradium | 1,617 ms | 1,560 ms | 409 ms | 2,400 |
| Scribe v2 | ElevenLabs | 2,073 ms | 2,080 ms | 60 ms | 2,400 |
Source: Coval STT benchmark. Statistical outliers above 2.3 s excluded from distribution.
How Should You Read the Latency Data?
Deepgram Nova 3 and Nova 2 sit at the same median TTFT (992 ms) but differ in standard deviation: Nova 3 at 125 ms Std vs Nova 2 at 210 ms Std. Nova 3 is both faster on average and more consistent in its latency distribution.
AssemblyAI Universal Streaming is 57 ms slower at P50 than Nova 3, with a 237 ms standard deviation, the highest of the three faster models. The tail latency is wider than either Deepgram model.
Gradium STT records a median TTFT of 1,560 ms, placing it 596 ms behind Nova 3 at P50. The standard deviation of 409 ms is the second highest in the benchmark, indicating meaningful latency variation across runs. The Delta IQR (interquartile range of the latency delta) is 8 ms, however, which reflects the core of the distribution rather than the tails.
ElevenLabs Scribe v2 is the slowest model on TTFT (median 2,080 ms, mean 2,073 ms) but records the lowest standard deviation of all five models (60 ms). Its latency is slow but highly predictable: the distribution is narrow and compact.
What Are the Accuracy Results?
Accuracy by Model
| Model | Provider | Avg WER | WER Std Dev |
|---|---|---|---|
| Gradium STT | Gradium | 2.4% | 5.1 |
| Scribe v2 | ElevenLabs | 3.1% | 5.7 |
| Universal Streaming | AssemblyAI | 4.2% | 6.1 |
| Nova 2 | Deepgram | 25.2% | 29.1 |
| Nova 3 | Deepgram | 25.3% | 30.1 |
Source: Coval STT benchmark. n=2,400 runs per model.
How Should You Read the Accuracy Data?
The WER ranking produces a different ordering than the latency ranking. Gradium STT records the lowest WER (2.4%) and ElevenLabs Scribe v2 the second lowest (3.1%). AssemblyAI Universal Streaming follows at 4.2%. The top three on accuracy are the bottom three on latency speed.
The most striking finding is the Deepgram WER: Nova 3 records 25.3% WER and Nova 2 records 25.2%, in the same benchmark. Despite being the fastest models by a significant margin, both Deepgram models produce roughly one word error for every four words transcribed in this benchmark. This is a tradeoff that is invisible when evaluating TTFT alone.
The WER standard deviation also reveals consistency differences. Gradium records 5.1 WER standard deviation, the lowest of all five models. ElevenLabs Scribe v2 records 5.7. The Deepgram models record 29.1 and 30.1, indicating that their WER is not only high on average but highly variable across runs.
How Do All Five Models Compare in One Heatmap?
| Model | P25 Delta (ms) | P50 Delta (ms) | P75 Delta (ms) | Delta IQR (ms) | Avg WER | WER Std Dev |
|---|---|---|---|---|---|---|
| Gradium STT | 593 | 596 | 601 | 8 | 2.4% | 5.1 |
| Deepgram Nova 2 | 19 | 22 | 25 | 6 | 25.2% | 29.1 |
| Deepgram Nova 3 | 0 | 0 | 0 | 0 | 25.3% | 30.1 |
| ElevenLabs Scribe v2 | 1,040 | 1,053 | 1,064 | 25 | 3.1% | 5.7 |
| AssemblyAI Universal Streaming | 52 | 57 | 67 | 15 | 4.2% | 6.1 |
Source: Coval STT benchmark. Latency deltas relative to Deepgram Nova 3 (baseline). n=2,400 runs per model.
The heatmap makes the fundamental tradeoff visible in one view. No single model leads on both latency and accuracy. The fastest models (Deepgram Nova 3 and Nova 2) record the worst WER. The most accurate model (Gradium STT, 2.4% WER) has the highest latency delta (+596 ms P50 vs baseline). The middle-ground models (AssemblyAI Universal Streaming at 4.2% WER and +57 ms P50) represent a different balance point.
How Does the Latency vs Accuracy Tradeoff Play Out in Practice?
For a voice agent where the user is waiting for a spoken response, the total perceived latency is the sum of: STT processing time, LLM response generation, and TTS time to first audio. The STT step is only one component of the full pipeline.
If the STT model contributes 992 ms (Nova 3) vs 1,560 ms (Gradium STT), the difference is 568 ms. Whether that difference is decisive depends on what the other pipeline components contribute. In a pipeline where TTS adds 155 ms (Coval TTS benchmark, Gradium TTS) and LLM adds 300 to 500 ms, the total latency gap from choosing Gradium STT over Deepgram Nova 3 represents a meaningful but not disqualifying share of total response time.
If the STT model contributes 25% WER (Nova 3) vs 2.4% WER (Gradium STT), the difference is 22.9 percentage points. In a voice agent handling real-world inputs, a 25% WER means the LLM receives wrong input roughly one time in four. For a customer support agent handling order numbers, appointment times, or billing questions, this produces incorrect responses that no amount of latency optimization can fix.
The decision framework:
- Prioritize TTFT when the voice interaction is command-and-control (short, simple inputs), total pipeline latency is the primary user experience constraint, and input content is predictable and high-confidence.
- Prioritize WER when the voice agent handles names, numbers, addresses, dates, product codes, or any structured data where transcription accuracy directly determines response quality. For these use cases, a 25% WER is not an acceptable tradeoff for saving 568 ms.
- Middle ground: AssemblyAI Universal Streaming (4.2% WER, +57 ms P50) sits closest to the upper-left of the latency-accuracy curve: relatively fast and meaningfully more accurate than the Deepgram models.
What Is Each Model's Profile?
Gradium STT
- TTFT: median 1,560 ms, Std Dev 409 ms, Delta IQR 8 ms
- WER: 2.4% avg, Std Dev 5.1 (lowest of all models on both metrics)
- Best for: voice agents where transcription accuracy on structured inputs (numbers, names, dates, addresses) is a hard requirement and total pipeline latency is manageable
- Gradium STT also includes semantic voice activity detection (VAD) for natural turn-taking in voice agents, not measured in this benchmark
Deepgram Nova 3
- TTFT: median 992 ms, Std Dev 125 ms, Delta IQR 0 (benchmark baseline)
- WER: 25.3% avg, Std Dev 30.1 (highest of all models)
- Best for: voice applications where speed is the primary constraint and input content is simple enough that high WER is acceptable (basic commands, keyword detection, navigation)
Deepgram Nova 2
- TTFT: median 992 ms, Std Dev 210 ms, Delta IQR 6 ms
- WER: 25.2% avg, Std Dev 29.1
- Very similar profile to Nova 3 on both metrics. Marginally lower WER (25.2% vs 25.3%), slightly higher latency variance (Std 210 ms vs 125 ms)
AssemblyAI Universal Streaming
- TTFT: median 1,061 ms, Std Dev 237 ms, Delta IQR 15 ms
- WER: 4.2% avg, Std Dev 6.1
- Best for: applications requiring a balance between acceptable latency and meaningfully better accuracy than Deepgram, where the 4.2% WER is sufficient for the use case
ElevenLabs Scribe v2
- TTFT: median 2,080 ms, Std Dev 60 ms (most consistent latency of all models), Delta IQR 25 ms
- WER: 3.1% avg, Std Dev 5.7
- Second-most accurate model in the benchmark. Slowest on TTFT by a significant margin (1,088 ms behind Nova 3 at P50). The extremely low Std Dev (60 ms) makes its latency highly predictable, but 2,080 ms median TTFT makes it unsuitable for real-time conversational voice agents
What Does This Mean for STT API Selection in 2026?
This Coval benchmark surfaces a structural split in the current STT market for voice agents. There is no single model that leads on both TTFT and WER across the models tested.
For production voice agents handling structured data: Gradium STT (2.4% WER) and ElevenLabs Scribe v2 (3.1% WER) are the two most accurate models in this benchmark. Scribe v2's 2,080 ms median TTFT makes it unsuitable for real-time applications. Gradium STT at 1,560 ms median TTFT is slower than the Deepgram models but meaningfully faster than Scribe v2, and pairs with TTS and semantic VAD in the same platform.
For simple command-and-control voice applications: Deepgram Nova 3 (fastest, 992 ms median TTFT) is the right choice if the use case tolerates 25%+ WER. Nova 3 and Nova 2 are statistically similar on both metrics in this dataset.
For a balance between speed and accuracy: AssemblyAI Universal Streaming (1,061 ms median TTFT, 4.2% WER) is 69 ms slower than Nova 3 at median and approximately 6x more accurate on WER (4.2% vs 25.3%). It is the closest model to the upper-left corner of the latency-accuracy curve in this benchmark.
On consistency: ElevenLabs Scribe v2 has the most predictable latency (Std Dev 60 ms) but the slowest absolute TTFT. Gradium STT has the smallest WER standard deviation (5.1), meaning its transcription accuracy is the most consistent across varied inputs. Deepgram's WER Std Dev (29 to 30) indicates that its accuracy fluctuates significantly between runs.
Where Does Gradium STT Sit in a Full Voice Pipeline?
Gradium's STT is part of the same platform as its TTS and voice cloning APIs. For developers building a complete voice agent pipeline (STT + LLM + TTS), running TTS and STT on the same platform simplifies integration: one API key, one billing relationship, one WebSocket architecture for both directions. See the flagship voice-agent TTS guide for the matching TTS coverage.
Gradium STT includes semantic VAD, which is not measured in this TTFT/WER benchmark but is relevant for voice agent quality. Semantic VAD determines when a user has finished a complete thought rather than just stopped making sound, enabling natural turn-taking without silence-threshold cut-offs. See Turn-Taking in Voice Agents for a deeper treatment of the turn-taking problem.
Gradium STT pricing:
- S plan: $43/month, 83 hours included (~$0.518/hr)
- M plan: $340/month, 833 hours included (~$0.408/hr)
- L plan: $1,615/month, 4,167 hours included (~$0.388/hr)
For full pricing details, see gradium.ai/pricing. STT and TTS share the same monthly plan credits.
How Should You Frame the STT Choice in 2026?
The Coval STT benchmark on five models across 2,400 runs each reveals a clear split in the 2026 STT market for voice agents: the fastest models on TTFT (Deepgram Nova 3 and Nova 2, median 992 ms) record the highest WER (25.2 to 25.3%), while the most accurate model (Gradium STT, 2.4% WER) is slower on TTFT (median 1,560 ms).
For voice agents handling structured data, the WER gap between Deepgram (25%+) and Gradium (2.4%) represents a production failure rate difference that latency savings cannot compensate for. For simple voice applications where input is predictable and speed is the primary constraint, Deepgram Nova 3's 992 ms median TTFT and consistent distribution make it the right choice.
AssemblyAI Universal Streaming (1,061 ms median TTFT, 4.2% WER) occupies the middle of the latency-accuracy curve and represents a viable balance point for use cases where neither extreme is optimal. ElevenLabs Scribe v2 (3.1% WER, 2,080 ms TTFT) is the second-most accurate model but too slow for real-time conversational use.
For developers building a full voice pipeline, Gradium's combination of lowest STT WER (2.4%) in this benchmark, semantic VAD for natural turn-taking, and integration with a TTS that records 155 ms TTFA P50 in the Coval TTS benchmark in a single API provides a consistent set of production benchmarks across both directions of the voice stack.