What is semantic VAD?

Semantic VAD is a turn-detection technique that decides whether a speaker has finished a turn based on what was said, not just whether there is sound in the audio. It typically uses a language-aware model that consumes partial transcripts or audio tokens and outputs the probability that the utterance is complete.

How is semantic VAD different from acoustic VAD?

Acoustic VAD classifies whether short audio frames contain speech based on signal properties like energy and spectral shape: it answers "is there a voice right now?" Semantic VAD answers "is the user done talking?" Acoustic VAD is useful for gating audio and rejecting background noise. Semantic VAD is what you use for turn-taking in a voice agent.

Why does my voice agent cut users off mid-sentence?

The usual cause is acoustic-only turn detection with a silence threshold. Users pause to think or recall details like account numbers and addresses, the silence crosses the threshold, and the agent commits to "they're done." Switching to semantic VAD, or combining the two, fixes the common cases.

What is the recommended end-of-turn threshold in Gradium STT?

Start with the 2 s horizon at index 2 of the vad array and a threshold of 0.5: msg["vad"][2]["inactivity_prob"] > 0.5. Lower the threshold or read a shorter horizon for faster reactions. Raise the threshold or require several consecutive high-confidence steps for more conservative behavior.

What does delay_in_frames control in Gradium STT?

It sets how much audio context the STT model accumulates before emitting text. Each frame is 80 ms. Supported values are 7, 8, 10, 12, 14, 16, 20, 24, 32, 36, and 48. Lower delays are more reactive but less stable on noisy or accented speech; higher delays are the opposite.

Should I combine acoustic VAD and semantic VAD?

Often, yes. Acoustic VAD (Silero, WebRTC VAD) is still useful for opening the microphone, detecting start-of-speech, and rejecting background noise. Gradium's semantic VAD handles end-of-turn detection. Production voice agents commonly run this combination.

What is the flushing mechanism in real-time STT?

When your agent decides a turn has ended, calling send_flush() forces the Gradium server to process all buffered audio and emit any pending text before the agent runs its next stage. Without flushing, the agent risks acting on a transcript that has not yet caught up to the audio it just listened to.

Does semantic VAD work on noisy phone calls?

Yes, more reliably than acoustic-only VAD. Acoustic VAD is fragile against background speech, low-bitrate codecs, and non-speech transients (a door shutting, a cough). Semantic VAD is anchored in transcribed content, so transients without recognizable words do not register as turn boundaries.

Semantic VAD: turn detection that uses meaning, not silence

Picture a common interaction. The user says "I'd like to cancel my flight from Boston to..." and pauses for a second to check the date on their phone. The agent jumps in: "Got it, cancelling your flight from Boston. Where to?" The user now has to interrupt the agent's interruption to finish the sentence they started.

Every voice agent built on acoustic-only turn detection has this failure mode, and it's almost always traced back to one decision the system has to make every 80 milliseconds: has the user actually stopped talking?

This post covers what acoustic and semantic VAD actually do, why the distinction matters, what Gradium's STT does differently, and how to wire it into your agent loop.

Acoustic VAD vs semantic VAD

Acoustic VAD classifies short audio frames as speech or non-speech based on signal properties: energy, spectral shape, harmonicity. It answers "is there a voice in this 20 ms window?" and nothing beyond that. Classical implementations like WebRTC VAD and Silero VAD work this way, and they work well for the task they were designed for: gating audio, suppressing background noise, deciding when to start a transcription.

The failure mode shows up when acoustic VAD gets used as a turn detector. "Is there a voice right now?" is not the same question as "is the user done talking?" A 400 ms pause mid-sentence ("I'd like to book a flight to... uh... Lisbon") is acoustically identical to a 400 ms pause at the end of a turn. An energy-based detector that fires after 500 ms of silence will interrupt the first case and feel sluggish on the second.

Semantic VAD adds language context. Instead of deciding whether audio contains speech, it predicts whether the speaker's utterance is meaningfully complete. The signal comes from what the user is saying, not just from the audio envelope: lexical content, syntactic completeness, intonation, fillers.

The two are complementary. Acoustic VAD is useful for opening the microphone and rejecting noise, semantic VAD for turn-taking.

A note on terminology. Endpointing (sometimes phrase endpointing) is the classic speech-recognition term for detecting where a turn ends; historically it was just a silence timer sitting on top of acoustic VAD. Semantic endpointing and semantic VAD are the two labels you will see for making that same end-of-turn decision from linguistic content rather than silence alone.

OpenAI's Realtime API exposes it as a semantic_vad turn-detection mode, AssemblyAI calls it semantic endpointing, and others frame the same capability as end-of-turn (EOT) or end-of-utterance (EOU) detection, or simply turn detection. They are all answering the one question that matters: has the user actually finished their turn? This post uses semantic VAD throughout.

Why this matters for voice agents

Three failure modes drop out of a turn detector that only sees acoustic signal:

Early interruption on hesitation. Users pause to think, especially when answering open-ended questions or recalling specifics (account numbers, addresses, names). Acoustic-only turn detection commits to "they're done" the moment silence crosses a threshold. The user gets cut off, the agent responds to half a request, and the recovery loop ("sorry, can you repeat that?") burns several seconds and erodes trust.

Sluggish responses on completed turns. The conservative fix is to raise the silence threshold (say, 800 ms). Now hesitation gets handled, but every completed turn carries an extra half-second of dead air before the agent speaks. The conversational gap balloons from 200 ms to 1000 ms and the agent feels slow.

Telephony and noisy channels. Acoustic VAD is fragile against background speech, low-bitrate codecs, and crosstalk. Semantic signal degrades more gracefully because it's anchored in the words actually being transcribed.

As an example, a Gradium customer running a German-language customer-support voice agent ran into a sharper version of this. Callers reading their account number ("Meine Kundennummer ist 1, 2, 3...") would get cut off mid-sequence: the short pauses between spoken digits crossed the silence threshold, the agent finalized the transcript at "Meine Kundennummer ist 1, 2," and the reasoning layer was handed an incomplete number.

Semantic VAD looks at the trailing token (a partial digit sequence after "ist") and waits, because the model has learned that a number sequence after "ist" is not yet a complete utterance. More generally speaking, when reading or saying numbers the natural tempo changes, which the semantic VAD tackles.

What Gradium does differently

Gradium STT emits turn-completion predictions directly from the audio model, every 80 ms, as part of the same WebSocket stream that delivers transcripts. Each step message contains a vad array with inactivity probabilities at multiple future horizons, not a single binary decision. You get a forecast: how likely is the user to remain inactive in 0.5s, 1s, 2s, 3s from now. The agent picks the horizon that matches its desired feel, depending on the use case and confidence level needed.

A few properties this gives you:

No second model in the loop. Turn detection latency is 0 ms above transcription latency, because the same forward pass produces both.
Tunable per use case. Pick the horizon and threshold that match how reactive vs. conservative you want the agent to be (see the next section).
Designed to pair with explicit flushing. When your agent decides the turn is over, send_flush() forces the server to process outstanding audio and emit any pending text before the agent runs its next stage. This eliminates the race between "I decided the turn ended" and "the transcript caught up".

Underneath these knobs, the model runs on two different time scales, and most of the tuning below comes down to reconciling them. The VAD runs in real time: every 80 ms it reports how likely the user is to stay silent over the next few seconds, given everything heard so far. The transcript runs on a slower clock, because the model only emits text once it has accumulated delay_in_frames of audio context, so it always trails the live audio by that delay. The VAD answers "is the user done?" for the present moment, while the words that confirm it arrive a beat later.

More specifically, the delay_in_frames parameter controls how much audio context the model accumulates before emitting text. Supported values are 7, 8, 10, 12, 14, 16, 20, 24, 32, 36, and 48 frames (each frame is 80 ms). Lower delay is more reactive, higher delay allows more context and thus higher transcription accuracy.

How to use it

The full reference lives at docs.gradium.ai, and you can check some of our examples here.

Open an STT realtime session with a delay_in_frames that matches your latency budget. Our default is 10, but it depends on your use case. For many conversational agents, 16 (1280 ms of model context) is also a reasonable default.

python

import gradium

client = gradium.client.GradiumClient(api_key="your-api-key")

async with client.stt_realtime(
    model_name="default",
    input_format="pcm",
    json_config={"language": "en", "delay_in_frames": 10},
) as stt:
    ...

Consume step messages and read the inactivity probability at the horizon that matches your product feel. msg["vad"][3] is the longest horizon (most confident "the user is done"); earlier indices are shorter horizons (more reactive, more false positives).

python

def turn_has_probably_ended(msg):
    if msg["type"] != "step" or not msg["vad"]:
        return False
    return msg["vad"][3]["inactivity_prob"] > 0.5

For noisy inputs or telephony, require several consecutive high-confidence steps before committing to the end of turn. This is what the turn-taking documentation recommends.

python

high_vad_steps = 0
async for msg in stt:
    if msg["type"] == "step":
        inactivity = msg["vad"][3]["inactivity_prob"]
        high_vad_steps = high_vad_steps + 1 if inactivity > 0.5 else 0
        if high_vad_steps >= 3 and transcript:
            await stt.send_flush(flush_id=next_flush_id())

Three knobs drive the behavior: delay_in_frames (transcript stability), which horizon you read from the vad array (lookahead), and the threshold + debounce on the inactivity probability (commit policy). Starting points by use case:

Use case	`delay_in_frames`	Horizon read	Threshold	Consecutive high steps
Snappy assistant, clean audio	`10` (800 ms)	`vad[2]`	0.30	1
Default conversational agent	`10` (800 ms)	`vad[2]`	0.50	1 or 2
Conversational agent, lots of thinking pauses (such as in the example audio)	`10` (800 ms)	`vad[3]`	0.50	1 or 2
Phone IVR or noisy/codec'd input	`16` (1.28 s)	`vad[3]`	0.60	3
Long-form dictation, slow speakers	`24` (1.92 s)	`vad[3]`	0.70	5+

Lower threshold or shorter horizon means the agent commits to end-of-turn sooner (faster, more false barge-ins). Higher delay_in_frames gives the transcription model more audio context before emitting text, which stabilizes downstream LLM input but adds latency before the agent can act.

The flushing trick for better latency

The flushing trick is what reconciles those two time scales at a turn boundary. When the VAD says the turn is over, the transcript still trails the audio by delay_in_frames (800 ms at the default of 10), so the model has not yet emitted the last words it heard. Acting immediately hands the LLM a truncated transcript; waiting out the full delay adds that much dead air to every turn.

send_flush() resolves the conflict: when your agent commits to end-of-turn, it forces the server to process the buffered audio and emit any pending text right away, instead of waiting for the delay window to elapse on its own. You get the complete transcript at turn boundaries without paying the delay as latency.

This is what lets a higher delay_in_frames (better transcription stability) coexist with low turn-taking latency, and it is why we recommend pairing an explicit flush with your end-of-turn decision rather than relying on the transcript to catch up.

Beyond turn-taking

Time to First Audio, voice cloning quality, and turn detection are the three places voice agent latency and feel get won or lost. See our posts on optimizing quality vs. latency in real-time TTS and Time to First Audio benchmarking for the other aspects of the pipeline.

Get started at gradium.ai or read the STT WebSocket reference. Questions: support@gradium.ai.

Acoustic VAD vs semantic VAD

Why this matters for voice agents

What Gradium does differently

How to use it

The flushing trick for better latency

Beyond turn-taking

Frequently Asked Questions