← Back to Blog

Semantic VAD: turn detection that uses meaning, not silence

7 min read
Semantic VAD: turn detection that uses meaning, not silence. A point-cloud rendering of a classical Venus statue beside looping particle trails on a dark blue background, with the Gradium logo.

Picture a common interaction. The user says "I'd like to cancel my flight from Boston to..." and pauses for a second to check the date on their phone. The agent jumps in: "Got it, cancelling your flight from Boston. Where to?" The user now has to interrupt the agent's interruption to finish the sentence they started.

Every voice agent built on acoustic-only turn detection has this failure mode, and it's almost always traced back to one decision the system has to make every 80 milliseconds: has the user actually stopped talking?

This post covers what acoustic and semantic VAD actually do, why the distinction matters, what Gradium's STT does differently, and how to wire it into your agent loop.

Acoustic VAD vs semantic VAD

Acoustic VAD classifies short audio frames as speech or non-speech based on signal properties: energy, spectral shape, harmonicity. It answers "is there a voice in this 20 ms window?" and nothing beyond that. Classical implementations like WebRTC VAD and Silero VAD work this way, and they work well for the task they were designed for: gating audio, suppressing background noise, deciding when to start a transcription.

The failure mode shows up when acoustic VAD gets used as a turn detector. "Is there a voice right now?" is not the same question as "is the user done talking?" A 400 ms pause mid-sentence ("I'd like to book a flight to... uh... Lisbon") is acoustically identical to a 400 ms pause at the end of a turn. An energy-based detector that fires after 500 ms of silence will interrupt the first case and feel sluggish on the second.

Semantic VAD adds language context. Instead of deciding whether audio contains speech, it predicts whether the speaker's utterance is meaningfully complete. The signal comes from what the user is saying, not just from the audio envelope: lexical content, syntactic completeness, intonation, fillers.

The two are complementary. Acoustic VAD is useful for opening the microphone and rejecting noise, semantic VAD for turn-taking.

A note on terminology. Endpointing (sometimes phrase endpointing) is the classic speech-recognition term for detecting where a turn ends; historically it was just a silence timer sitting on top of acoustic VAD. Semantic endpointing and semantic VAD are the two labels you will see for making that same end-of-turn decision from linguistic content rather than silence alone.

OpenAI's Realtime API exposes it as a semantic_vad turn-detection mode, AssemblyAI calls it semantic endpointing, and others frame the same capability as end-of-turn (EOT) or end-of-utterance (EOU) detection, or simply turn detection. They are all answering the one question that matters: has the user actually finished their turn? This post uses semantic VAD throughout.

Why this matters for voice agents

Three failure modes drop out of a turn detector that only sees acoustic signal:

Early interruption on hesitation. Users pause to think, especially when answering open-ended questions or recalling specifics (account numbers, addresses, names). Acoustic-only turn detection commits to "they're done" the moment silence crosses a threshold. The user gets cut off, the agent responds to half a request, and the recovery loop ("sorry, can you repeat that?") burns several seconds and erodes trust.

Sluggish responses on completed turns. The conservative fix is to raise the silence threshold (say, 800 ms). Now hesitation gets handled, but every completed turn carries an extra half-second of dead air before the agent speaks. The conversational gap balloons from 200 ms to 1000 ms and the agent feels slow.

Telephony and noisy channels. Acoustic VAD is fragile against background speech, low-bitrate codecs, and crosstalk. Semantic signal degrades more gracefully because it's anchored in the words actually being transcribed.

As an example, a Gradium customer running a German-language customer-support voice agent ran into a sharper version of this. Callers reading their account number ("Meine Kundennummer ist 1, 2, 3...") would get cut off mid-sequence: the short pauses between spoken digits crossed the silence threshold, the agent finalized the transcript at "Meine Kundennummer ist 1, 2," and the reasoning layer was handed an incomplete number.

Semantic VAD looks at the trailing token (a partial digit sequence after "ist") and waits, because the model has learned that a number sequence after "ist" is not yet a complete utterance. More generally speaking, when reading or saying numbers the natural tempo changes, which the semantic VAD tackles.

What Gradium does differently

Gradium STT emits turn-completion predictions directly from the audio model, every 80 ms, as part of the same WebSocket stream that delivers transcripts. Each step message contains a vad array with inactivity probabilities at multiple future horizons, not a single binary decision. You get a forecast: how likely is the user to remain inactive in 0.5s, 1s, 2s, 3s from now. The agent picks the horizon that matches its desired feel, depending on the use case and confidence level needed.

A few properties this gives you:

  • No second model in the loop. Turn detection latency is 0 ms above transcription latency, because the same forward pass produces both.
  • Tunable per use case. Pick the horizon and threshold that match how reactive vs. conservative you want the agent to be (see the next section).
  • Designed to pair with explicit flushing. When your agent decides the turn is over, send_flush() forces the server to process outstanding audio and emit any pending text before the agent runs its next stage. This eliminates the race between "I decided the turn ended" and "the transcript caught up".

Underneath these knobs, the model runs on two different time scales, and most of the tuning below comes down to reconciling them. The VAD runs in real time: every 80 ms it reports how likely the user is to stay silent over the next few seconds, given everything heard so far. The transcript runs on a slower clock, because the model only emits text once it has accumulated delay_in_frames of audio context, so it always trails the live audio by that delay. The VAD answers "is the user done?" for the present moment, while the words that confirm it arrive a beat later.

More specifically, the delay_in_frames parameter controls how much audio context the model accumulates before emitting text. Supported values are 7, 8, 10, 12, 14, 16, 20, 24, 32, 36, and 48 frames (each frame is 80 ms). Lower delay is more reactive, higher delay allows more context and thus higher transcription accuracy.

How to use it

The full reference lives at docs.gradium.ai, and you can check some of our examples here.

Open an STT realtime session with a delay_in_frames that matches your latency budget. Our default is 10, but it depends on your use case. For many conversational agents, 16 (1280 ms of model context) is also a reasonable default.

python
import gradium

client = gradium.client.GradiumClient(api_key="your-api-key")

async with client.stt_realtime(
    model_name="default",
    input_format="pcm",
    json_config={"language": "en", "delay_in_frames": 10},
) as stt:
    ...

Consume step messages and read the inactivity probability at the horizon that matches your product feel. msg["vad"][3] is the longest horizon (most confident "the user is done"); earlier indices are shorter horizons (more reactive, more false positives).

python
def turn_has_probably_ended(msg):
    if msg["type"] != "step" or not msg["vad"]:
        return False
    return msg["vad"][3]["inactivity_prob"] > 0.5

For noisy inputs or telephony, require several consecutive high-confidence steps before committing to the end of turn. This is what the turn-taking documentation recommends.

python
high_vad_steps = 0
async for msg in stt:
    if msg["type"] == "step":
        inactivity = msg["vad"][3]["inactivity_prob"]
        high_vad_steps = high_vad_steps + 1 if inactivity > 0.5 else 0
        if high_vad_steps >= 3 and transcript:
            await stt.send_flush(flush_id=next_flush_id())

Three knobs drive the behavior: delay_in_frames (transcript stability), which horizon you read from the vad array (lookahead), and the threshold + debounce on the inactivity probability (commit policy). Starting points by use case:

Use case delay_in_frames Horizon read Threshold Consecutive high steps
Snappy assistant, clean audio 10 (800 ms) vad[2] 0.30 1
Default conversational agent 10 (800 ms) vad[2] 0.50 1 or 2
Conversational agent, lots of thinking pauses (such as in the example audio) 10 (800 ms) vad[3] 0.50 1 or 2
Phone IVR or noisy/codec'd input 16 (1.28 s) vad[3] 0.60 3
Long-form dictation, slow speakers 24 (1.92 s) vad[3] 0.70 5+

Lower threshold or shorter horizon means the agent commits to end-of-turn sooner (faster, more false barge-ins). Higher delay_in_frames gives the transcription model more audio context before emitting text, which stabilizes downstream LLM input but adds latency before the agent can act.

The flushing trick for better latency

The flushing trick is what reconciles those two time scales at a turn boundary. When the VAD says the turn is over, the transcript still trails the audio by delay_in_frames (800 ms at the default of 10), so the model has not yet emitted the last words it heard. Acting immediately hands the LLM a truncated transcript; waiting out the full delay adds that much dead air to every turn.

send_flush() resolves the conflict: when your agent commits to end-of-turn, it forces the server to process the buffered audio and emit any pending text right away, instead of waiting for the delay window to elapse on its own. You get the complete transcript at turn boundaries without paying the delay as latency.

This is what lets a higher delay_in_frames (better transcription stability) coexist with low turn-taking latency, and it is why we recommend pairing an explicit flush with your end-of-turn decision rather than relying on the transcript to catch up.

Beyond turn-taking

Time to First Audio, voice cloning quality, and turn detection are the three places voice agent latency and feel get won or lost. See our posts on optimizing quality vs. latency in real-time TTS and Time to First Audio benchmarking for the other aspects of the pipeline.

Get started at gradium.ai or read the STT WebSocket reference. Questions: support@gradium.ai.

Frequently Asked Questions