Turn-Taking in Voice Agents: Why Rule-Based VAD Is Broken and What Comes Next
Turn-taking is the mechanism voice agents use to decide when to listen and when to speak. In every cascaded architecture (STT + LLM + TTS), the system needs to determine in real time when the user has finished their turn so the agent can respond. The dominant approach to solving this problem is Voice Activity Detection (VAD): an algorithm that monitors the audio stream and uses silence thresholds and rule sets to trigger the end-of-turn decision.
VAD-based turn-taking is what powers most production voice agents in 2026. It is also, according to Neil Zeghidour, CEO of Gradium, "the worst part of voice AI" and "the archaic era of handmade rules." This article covers what VAD-based turn-taking is, why it fails in natural conversation, how semantic VAD improves on it within cascade architecture, and what full duplex conversation looks like as the research-grade alternative.
What Is Turn-Taking in a Voice Agent?
In a turn-based voice agent, the conversation is modeled as alternating turns: the user speaks, the agent listens; the agent speaks, the user listens. Each party waits for the other to finish before responding. This is the model used by every cascaded voice agent architecture: STT transcribes the user's speech, an LLM generates a text response, TTS synthesizes that response into audio.
For this to work, the system needs to know when the user has finished their turn. It needs to answer two questions in real time:
- Is the user currently speaking or silent?
- Has the user finished their turn, or are they just pausing mid-thought?
The answer to both questions determines when the agent starts speaking. Get it wrong in one direction and the agent interrupts. Get it wrong in the other and it waits too long. Both failures make the interaction feel broken.
How Does Rule-Based VAD Work?
Voice Activity Detection is the algorithm used to answer those questions. In its basic form, VAD detects whether an audio signal contains speech or silence. A threshold of silence (500 milliseconds of no detected speech energy, for example) triggers the end-of-turn decision and hands control to the agent.
The problem is that this simple threshold fails constantly in real conversation. Humans pause mid-sentence. They breathe. They say "um" and "uh" while formulating thoughts. They trail off and pick up again. A flat silence threshold cannot distinguish between "I finished my turn" and "I am thinking" or "I am about to continue."
So developers add rules. If silence exceeds threshold X and the last word was a question word, treat it as end of turn. If silence exceeds threshold X but follows a number, wait longer. If the agent was just speaking and silence occurs within Y milliseconds, treat it as an interruption. If the agent detected background noise above level Z, ignore short silences.
Neil Zeghidour, CEO of Gradium, describes it precisely: "You have an algorithm called the voice activity detection algorithm that just says whether it's silent or not. And if it's silent more than x amount of milliseconds, then this rule and this rule and this rule, then it's an interruption. But if this happened, it's not an interruption. And so you have rules on top of rules on top of rules to decide whether the model should talk or not."
The structural parallel with early computer vision is direct. Before deep learning, image recognition relied on handcrafted feature detectors: algorithms designed to find edges at specific angles and textures at specific frequencies. These worked in constrained conditions and failed outside them. Machine learning replaced them because no finite set of rules could cover the variability of real-world inputs. Turn-taking is in the same position today. The variability of natural speech (pauses, breathing, trailing sentences, mid-thought silences) exceeds what a rule system can handle reliably, and each new rule added to extend coverage creates new boundary cases.
Why Does Rule-Based VAD Make Voice Agents Feel Unnatural?
The practical result of rule-based VAD in production is that the user must adapt to the system rather than the system adapting to the user. Users of cascaded voice agents learn to pace their speech artificially, avoid pauses, and clip their sentences to prevent the agent from interrupting or waiting at the wrong time. This behavioral adaptation is a direct product of VAD architecture: the system cannot distinguish between a user who has finished speaking and a user who is mid-thought but briefly silent.
Adding more rules does not resolve this. Natural speech contains pauses, restarts, thinking silences, trailing sentences, and overlapping emphasis. Each new rule introduced to handle one of these patterns creates new boundary cases. The coverage problem grows with each rule added. The practical consequence is that the user adapts to the agent rather than the agent adapting to the user.
What Is Semantic VAD and How Does It Improve Cascade Turn-Taking?
Between rule-based VAD and full duplex lies an intermediate approach: semantic voice activity detection. Instead of making the end-of-turn decision based purely on silence duration, semantic VAD uses the content of what the user is saying to determine whether their turn is complete. A model that understands language can recognize that a sentence ending with a question implies an expected response, while a sentence that trails mid-clause likely continues.
Semantic VAD reduces the most common failure modes of rule-based VAD: premature interruptions when a user pauses to think, and missed end-of-turns when a user's final sentence ends without trailing silence. It does not require silence as a primary signal. The agent can wait for a semantically complete thought rather than a long enough gap.
Gradium's streaming STT API includes semantic VAD as part of its architecture. Rather than triggering end-of-turn on silence alone, the model uses both acoustic and linguistic context to determine when the user has finished a complete thought. This reduces the most disruptive category of turn-taking failures in cascaded systems: the agent interrupting a user who is mid-sentence but briefly silent.
Semantic VAD is an improvement over rule-based VAD within the cascade architecture. It is not full duplex.
What Does Full Duplex Conversation Look Like? How Moshi Works
The research-grade solution to turn-taking is to remove turns entirely. Full duplex conversation means the model is always listening and always capable of speaking simultaneously, without a gate that decides when one party is allowed to begin.
Neil Zeghidour's team at Kyutai (the research lab that preceded and runs in parallel with Gradium) built and published Moshi, the first fully open full duplex conversational model. The core architectural idea, which Neil Zeghidour describes as "an extremely simple thing", is to model the conversation as two simultaneous token streams rather than one:
"We took the audio language model, instead of having it modeling one stream of tokens, we called it multistream: just two streams of tokens, one for the user, one for the AI. Both can be active at the same time and there is no turn-taking anymore. You train it on stereo data: people talking on the phone, one person on the left channel, one on the right. Your model models both at the same time."
— Neil Zeghidour, CEO of Gradium
The result is that the model can speak at any moment and continue listening while speaking. A user can pause for several seconds mid-thought without triggering an interruption, and the agent does not wait for a silence threshold to respond. In a full duplex system, the traditional definition of TTFT (time from end-of-turn to first response token) no longer applies, because turns no longer exist as a structure. The agent responds when it has something to say, not when a rule tells it the user has finished.
Moshi is open source, available at moshi.chat, and was built by a core team of four to six people in approximately six months.
Why Do Production Systems Still Use Cascade in 2026?
Despite Moshi's existence, virtually every production voice agent deployed today, including those built on Gradium's API, uses a cascaded architecture. Neil Zeghidour is direct about the reason:
"At the moment what we do is cascaded systems because that's where the market is right now. People are still iterating a lot on the underlying text models that they want to use, on tool use and so on, and there is so much progress on the text side."
The core constraint is modularity. In a cascade architecture, the LLM, STT, and TTS components are independent: each can be updated or replaced without retraining the others. In a speech-to-speech model, all three are integrated in a single trained system. Switching the underlying LLM requires fine-tuning the entire model again on speech data, which in a period of rapid LLM development (multiple major model releases per year) creates an unacceptable operational cost. Developers building in 2026 cannot afford to re-train a speech model every time a better LLM base becomes available.
This is a practical constraint, not a fundamental limitation of full duplex architecture. For the full architecture comparison, see Cascaded Voice Agents vs Speech-to-Speech.
Where Is the Architecture Going?
The direction is clear: turn-based conversation is a transitional architecture. The question is not whether full duplex will replace it, but when and how the integration constraints are resolved.
The remaining blocker is not modeling capability. Moshi demonstrated that full duplex conversation is technically achievable with a small team and modest compute. The blocker is integration: building a full duplex model that can accept backend LLM updates without requiring a full re-training cycle. Once that modular connection layer exists, full duplex can offer the same plug-and-play flexibility as cascade systems today, while removing the turn-taking constraint entirely.
Gradium's position is that cascade with semantic VAD is the right production architecture today.
What Are the Practical Implications for Voice Agent Developers in 2026?
For developers building voice agents today, the turn-taking architecture they choose has concrete consequences.
If using a cascade system with rule-based VAD: the agent will interrupt users who pause to think and require users to adapt their speaking style. The quality of turn-taking depends heavily on VAD configuration and is difficult to tune for natural conversation. This is the current baseline for most production deployments.
If using a cascade system with semantic VAD: end-of-turn detection uses linguistic context in addition to silence. The most disruptive failure mode (interrupting a user who is mid-thought) is reduced. Users do not need to adapt their cadence as much. The interaction remains turn-based but functions more naturally within that constraint.
If using a full duplex model: the turn-based structure disappears. Latency in the traditional sense no longer applies. The interaction behaves more like a phone conversation between humans. The current tradeoff is modularity: changing the underlying intelligence requires re-training the model.
The latency data from the Coval STT benchmark (May 2026) gives context for the cascade pipeline in production: the STT step alone contributes between 992 ms and 2,080 ms of TTFT depending on the model chosen, before the LLM and TTS steps are added. Semantic VAD reduces the end-of-turn detection error rate within that pipeline; it does not reduce the pipeline's fundamental latency floor.
How Should You Frame the Turn-Taking Choice?
Turn-taking is the mechanism through which most voice agents decide when to listen and when to speak. The dominant approach, Voice Activity Detection with rule-based thresholds, produces an interaction that requires the user to adapt to the agent. This is not a configuration issue. It is an architectural constraint: applying handcrafted rules to a problem that scales better with learned models trained on the right data.
Semantic VAD reduces the most disruptive failure modes within cascade architecture by adding linguistic context to the end-of-turn decision. Full duplex removes the concept of turns entirely, enabling a conversation that behaves like a phone call between two humans. Both approaches represent progress on the same underlying problem.
The research demonstrating that full duplex is achievable already exists in the form of Moshi, built by the founding team of Gradium at Kyutai.
For developers building voice agents today, the choice between rule-based VAD, semantic VAD, and eventual full duplex is a choice between three different points on a spectrum from rule-governed to learned conversation dynamics. Each has different cost, latency, naturalness, and modularity tradeoffs. The direction of travel is clear.