Semantic VAD for Voice Agents: Turn Detection 2026
In natural conversation, the gap between one person finishing a sentence and the next starting to respond averages around 200 milliseconds [1]. A voice agent has to approximate that timing while making one decision over and over, roughly every 80 milliseconds: has the user actually stopped talking, or have they just paused?
Get that decision wrong and the failure is immediate. The user says "I'd like to cancel my flight from Boston to..." and pauses for a second to check a date on their phone. The agent jumps in: "Got it, cancelling your flight from Boston. Where to?" The user now has to interrupt the agent's interruption just to finish the sentence they were already saying. Semantic VAD is the approach to that decision that uses what the user is saying, not just whether there is sound in the audio stream. This article covers what semantic VAD actually does, how it differs from the acoustic VAD most systems still rely on, and how to configure it for a real voice agent.
Why voice agents interrupt users mid-sentence
The failure mode in practice
Every voice agent built on acoustic-only turn detection shares the same weak point. A 400 millisecond pause in the middle of a sentence, "I'd like to book a flight to... uh... Lisbon", is acoustically identical to a 400 millisecond pause at the end of a sentence. A system that only measures whether sound is present cannot tell the two apart. It fires after a fixed silence threshold regardless of what was actually said, which means it will sometimes interrupt a user who is still mid-thought and sometimes wait through dead air after a user has clearly finished.
Why a silence timer cannot fix this
The tempting fix is to raise the silence threshold, say from 400 milliseconds to 800 milliseconds. This reduces premature interruptions, but at a direct cost: every completed turn now carries an extra half-second of dead air before the agent responds. Natural human turn-taking has a gap of around 200 milliseconds. Pushing the threshold to 800 milliseconds or beyond turns that gap into something that feels noticeably slow, even though it successfully avoids cutting people off. There is no single silence threshold that solves both problems at once, because silence duration alone does not contain the information needed to distinguish a hesitation from a completed thought.
Acoustic VAD vs semantic VAD
What acoustic VAD actually answers
Acoustic VAD classifies short audio frames as speech or non-speech based on signal properties: energy, spectral shape, harmonicity. It answers one specific question: is there a voice in this roughly 20 millisecond window? Nothing more. Classical implementations like WebRTC VAD and Silero VAD work this way, and they do their intended job well: gating a microphone, suppressing background noise, deciding when to start transcribing audio at all.
The problem appears specifically when acoustic VAD gets repurposed as a turn detector, because "is there a voice right now" is a different question from "is the user done talking." Acoustic VAD has no mechanism to tell these apart, because both situations look identical at the signal level.
What semantic VAD adds
Semantic VAD adds language context to the decision. Instead of classifying whether audio contains speech, it predicts whether the speaker's utterance is meaningfully complete, using the lexical content, syntactic structure, intonation, and filler words actually present in what was said, not just the shape of the audio envelope. The two approaches are complementary rather than competing: acoustic VAD is the right tool for opening a microphone and rejecting noise, and semantic VAD is the right tool for deciding when a turn has actually ended.
A note on terminology: endpointing, EOT, EOU
This capability goes by several names depending on who is describing it, which matters for anyone searching for it. Endpointing, sometimes phrase endpointing, is the classic speech-recognition term for detecting where a turn ends, and historically it was just a silence timer layered on top of acoustic VAD. Semantic endpointing and semantic VAD are the two labels most commonly used today for making that same end-of-turn decision from linguistic content rather than silence duration alone.
OpenAI's Realtime API exposes this as a semantic_vad turn-detection mode. AssemblyAI calls the equivalent capability semantic endpointing. Other providers frame it as end-of-turn (EOT) detection or end-of-utterance (EOU) detection, or simply turn detection. All of these terms are answering the same underlying question: has the user actually finished their turn? This article uses semantic VAD throughout, but if you have encountered any of the other terms, they refer to the same problem and largely the same class of solution.
The three failure modes semantic VAD solves
Early interruption on hesitation
Users pause to think, particularly when answering open-ended questions or recalling specific details like account numbers, addresses, or names. Acoustic-only turn detection commits to "they're done" the instant silence crosses its threshold, with no awareness of what was actually being said. The user gets cut off mid-thought, the agent responds to an incomplete request, and the resulting recovery exchange, the agent asking the user to repeat themselves, burns several seconds and visibly erodes trust in the system.
Sluggish responses on completed turns
The conservative fix, raising the silence threshold to something like 800 milliseconds, handles hesitation better but pushes every single completed turn to carry an extra half-second of dead air before the agent responds. The natural conversational gap of around 200 milliseconds balloons toward a full second, and the agent starts to feel slow even when it never interrupts anyone.
Telephony and noisy channels
Acoustic VAD is fragile against background speech, low-bitrate codecs, and crosstalk, all common conditions on a real phone call. Semantic signal degrades more gracefully under these conditions because it is anchored in the words actually being transcribed, rather than in the raw shape of an audio signal that telephony compression and background noise can distort.
Real example: a German-language support agent
A Gradium customer running a German-language customer support voice agent ran into a sharper version of this problem. Callers reading out their account number out loud, in German, would get cut off mid-sequence. The short pauses naturally occurring between spoken digits were enough to cross the system's silence threshold, so the agent finalized the transcript partway through the number and handed an incomplete value to its reasoning layer.
Semantic VAD addresses this by recognizing the trailing context of what was said. After a phrase that introduces a number, the model has learned that a partial digit sequence is not yet a complete utterance, so it waits rather than committing to end-of-turn. The underlying pattern is general: when someone is reading or saying a sequence of numbers, the natural pace and rhythm of speech changes, and a turn detector that only watches for silence has no way to account for that shift.
How Gradium's semantic VAD works
Multi-horizon inactivity prediction
Gradium STT emits turn-completion predictions directly from the audio model itself, every 80 milliseconds, as part of the same WebSocket stream that delivers transcripts. Each step message includes a vad array containing inactivity probabilities at multiple future horizons, rather than a single binary yes-or-no decision. In practice, this means the model produces a short forecast: how likely is the user to remain inactive over the next 0.5, 1, 2, or 3 seconds. An agent can then read whichever horizon best matches the reactivity it needs for its specific use case.
This design has a direct practical benefit: there is no second model running in the turn-detection loop. Turn detection latency is effectively 0 milliseconds above transcription latency, because the same forward pass that produces the transcript also produces the VAD forecast.
The delay_in_frames parameter
The model operates on two different time scales, and most configuration decisions come down to reconciling them. The VAD signal runs in real time: every 80 milliseconds, it reports how likely the user is to stay silent over the next few seconds based on everything heard so far. The transcript itself runs on a slower clock, because the model only emits text once it has accumulated a certain amount of audio context, controlled by the delay_in_frames parameter, so the transcript always trails the live audio by that delay.
Supported values for delay_in_frames are 7, 8, 10, 12, 14, 16, 20, 24, 32, 36, and 48 frames, where each frame represents 80 milliseconds of audio. A lower value makes the system more reactive at the cost of less context for the transcription model to work with. A higher value gives the model more context, which generally improves transcription accuracy, at the cost of added latency before text is emitted. Gradium's default is 10 frames (800 milliseconds), though 16 frames (1,280 milliseconds) is also a reasonable default for many conversational agents.
The flushing mechanism
The flushing mechanism exists specifically to reconcile the gap between the VAD's real-time signal and the transcript's delayed signal at the exact moment a turn ends. When the VAD indicates the turn is over, the transcript has not yet caught up, it still trails the audio by the full delay_in_frames window, so the model has not yet emitted the final words it actually heard.
Acting immediately on the VAD signal alone hands the next stage of the pipeline a truncated transcript. Waiting for the delay window to elapse naturally adds that same amount of dead air to every single turn. The send_flush() call resolves this directly: when the agent commits to end-of-turn, it forces the server to process any buffered audio and emit pending text immediately, rather than waiting for the delay window to run out on its own. This is what allows a higher delay_in_frames value, which improves transcription stability, to coexist with low turn-taking latency, and it is the reason an explicit flush paired with the end-of-turn decision is the recommended pattern rather than simply waiting for the transcript to catch up.
How to configure turn detection for your use case
Three settings control the behavior of semantic VAD in practice: delay_in_frames (how much context the transcript waits for), which horizon in the vad array the agent reads (how far ahead it looks), and the threshold plus debounce applied to the inactivity probability (how confident the system needs to be before committing to end-of-turn).
| Use case | delay_in_frames | Horizon read | Threshold | Consecutive high steps |
|---|---|---|---|---|
| Snappy assistant, clean audio | 10 (800 ms) | vad[2] |
0.30 | 1 |
| Default conversational agent | 10 (800 ms) | vad[2] |
0.50 | 1 or 2 |
| Conversational agent with frequent thinking pauses | 10 (800 ms) | vad[3] |
0.50 | 1 or 2 |
| Phone IVR or noisy, codec-compressed input | 16 (1.28 s) | vad[3] |
0.60 | 3 |
| Long-form dictation, slow speakers | 24 (1.92 s) | vad[3] |
0.70 | 5+ |
A lower threshold or a shorter horizon makes the agent commit to end-of-turn sooner, which is faster but produces more false barge-ins. A higher delay_in_frames value gives the transcription model more audio context before it emits text, which stabilizes the input the LLM downstream receives, at the cost of added latency before the agent can act.
For noisy inputs or telephony specifically, the recommended pattern is to require several consecutive high-confidence steps before committing to end-of-turn, rather than acting on a single reading. This reduces the chance that a brief acoustic artifact or a misread step triggers a premature commitment.
A minimal working session looks like this in practice: open an STT realtime session with a delay_in_frames matching your latency budget, consume the resulting step messages, and read the inactivity probability at the horizon that matches the product feel you want. The longest available horizon (the last index in the vad array) gives the most confident "the user is done" signal; earlier indices are more reactive but produce more false positives. Once the chosen threshold and debounce condition are met, the agent calls send_flush() to retrieve the complete trailing transcript before acting.
Get started
Gradium STT delivers semantic VAD as a multi-horizon inactivity forecast in the same WebSocket stream as the transcript, with no second model in the turn-detection loop. Start building at gradium.ai, or read the deeper write-up in Semantic VAD: turn detection that uses meaning, not silence.
Glossary
Acoustic VAD. Voice Activity Detection that classifies short audio frames as speech or non-speech based on signal properties like energy and spectral shape. Answers whether sound is present in a given window, with no awareness of linguistic content. Used for tasks like microphone gating and noise suppression. Classical implementations include WebRTC VAD and Silero VAD.
Semantic VAD. Voice Activity Detection that predicts whether a speaker's utterance is meaningfully complete, using linguistic content such as lexical content, syntactic structure, intonation, and fillers, rather than silence duration alone. Used specifically for turn-taking decisions in voice agents. Also referred to as semantic endpointing, end-of-turn (EOT) detection, or end-of-utterance (EOU) detection depending on the provider.
Endpointing. The classic speech-recognition term for detecting where a spoken turn ends. Historically implemented as a fixed silence timer layered on top of acoustic VAD. Modern semantic endpointing replaces the silence timer with a linguistic-content-based prediction of turn completion.
delay_in_frames. A configuration parameter in Gradium STT controlling how much audio context, measured in 80 millisecond frames, the transcription model accumulates before emitting text. Supported values range from 7 to 48 frames. Lower values increase reactivity; higher values increase transcription accuracy at the cost of latency. Default is 10 frames (800 milliseconds).
Inactivity probability. A forecasted likelihood, produced by Gradium's STT model every 80 milliseconds, that the user will remain silent over a specific future time horizon (such as 0.5, 1, 2, or 3 seconds). Delivered as a multi-horizon array rather than a single binary decision, allowing an agent to choose the lookahead window matching its desired responsiveness.
Flushing. A mechanism that forces a speech-to-text server to process buffered audio and emit any pending transcript text immediately, rather than waiting for the model's configured delay window to elapse naturally. Used at the moment a turn-detection decision is made, to retrieve a complete transcript without incurring the full delay window as added latency.
Barge-in. When a voice agent interrupts a user who has not finished speaking, typically caused by a turn detector incorrectly classifying a mid-sentence pause as the end of a turn. A primary failure mode that semantic VAD is designed to reduce compared to acoustic-only turn detection.
References
[1] Stivers, T., Enfield, N. J., Brown, P., et al. "Universals and cultural variation in turn-taking in conversation." PNAS 106(26), 2009, 10587–10592. doi.org/10.1073/pnas.0903616106.