What is VAD in the context of voice agents?

VAD (Voice Activity Detection) is the algorithm that determines whether audio input contains speech or silence. In cascaded voice agents, VAD is used to detect when the user has stopped speaking and trigger the agent's response. Basic VAD uses silence duration thresholds combined with rule sets to distinguish between mid-sentence pauses and actual end-of-turn signals. The limitation, as described by Gradium CEO Neil Zeghidour, is that this produces rules on top of rules on top of rules that cannot cover the full range of natural speech patterns.

What is semantic VAD and how is it different from standard VAD?

Semantic VAD uses the meaning and structure of what the user is saying to determine end-of-turn, rather than relying on silence duration alone. A model using semantic VAD can recognize that a question implies an expected response, or that a sentence ending mid-clause is likely to continue. This reduces the most common failure mode of rule-based VAD: interrupting a user who has paused briefly while still mid-thought. Gradium's streaming STT API includes semantic VAD as part of its architecture.

What is full duplex conversation and how is it different from turn-based?

In a turn-based system, only one party speaks at a time. The agent waits for the user to finish before speaking. In a full duplex system, both parties can speak simultaneously. The model is always listening and always capable of generating output. There is no end-of-turn detection because turns no longer exist as a concept. Neil Zeghidour describes the full duplex Moshi model: a user can take five seconds to think about what they are going to say and then talk, and the model will not be lost.

What is Moshi and who built it?

Moshi is a full duplex conversational AI model built by Kyutai, the research lab co-founded by Neil Zeghidour (now CEO of Gradium) and Alexandre Défossez (now Chief Science Officer of Gradium). It was developed by a core team of four to six people in approximately six months and released as open source. It models conversation as two simultaneous streams of tokens (one for the user, one for the AI) trained on stereo phone conversation data.

Why do production voice agents still use cascade architecture in 2026 if full duplex exists?

The primary reason is modularity. In a cascade architecture, the LLM, STT, and TTS components can each be updated independently. In a speech-to-speech model, updating the underlying LLM requires re-training the entire model from scratch on speech data, which is time-consuming and expensive in a period of rapid LLM development. Full duplex becomes practical in production when the modular flexibility of cascade systems can be replicated within a full duplex architecture.

What does full duplex mean for voice agent latency?

In a turn-based system, latency is defined as the time between when the user finishes speaking and when the agent starts speaking. In a full duplex system, this definition no longer applies because the agent can respond at any moment. Neil Zeghidour explains that latency only makes sense in a turn-based conversation as the time between two turns. In a full duplex context, there is no real latency anymore: the model can talk at any time, it can talk over the user and the user can talk over it, which makes the conversation natural.

What is the timeline for full duplex voice agents in production?

Neil Zeghidour does not give a specific timeline but states that eventually it is impossible to think the industry will stick to turn-taking. The blocker is not modeling capability but integration: providing the same flexibility as cascaded systems so developers can change the backend LLM on demand, with the same customizability as text models but with full duplex. Moshi already demonstrates that full duplex conversation is technically achievable. The remaining work is the engineering integration layer that makes it as modular as cascade systems are today.

Is rule-based VAD still in use in production voice agents in 2026?

Yes. The majority of production voice agents in 2026 still use rule-based VAD with silence thresholds and rule sets to determine end-of-turn. Semantic VAD is an improvement available from a smaller set of STT providers including Gradium. Full duplex is a research-grade alternative that is not yet widely available as a production API.

Where can I find the Moshi open-source model?

Moshi is available at moshi.chat and on the Kyutai GitHub. It was released as open source and remains the reference implementation for full duplex speech-to-speech conversational AI. Kyutai is the open research lab co-founded by Gradium's CEO and Chief Science Officer.

Gradium was co-founded by Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez, who previously co-founded Kyutai. Kyutai released world-first open systems including Moshi (real-time speech-to-speech) and Hibiki (live speech-to-speech translation).

Turn-Taking in Voice Agents: Why Rule-Based VAD Is Broken and What Comes Next

Q: Why does rule-based VAD make voice agents feel unnatural?

Rule-based VAD cannot reliably distinguish between a user pausing to think and a user finishing their turn. It also cannot handle interruptions, overlapping speech, or sentences that trail off naturally. The result is that users must speak in a constrained, unnatural way to avoid confusing the agent. Neil Zeghidour describes it as requiring discipline when talking to AI: users adapt to the agent's flow rather than the other way around.

Q: Does Gradium's STT include semantic VAD?

Yes. Gradium's STT API includes semantic voice activity detection as part of its architecture. Rather than triggering end-of-turn on silence alone, the model uses both acoustic and linguistic context to determine when the user has finished a complete thought. This reduces the most disruptive category of turn-taking failures in cascaded systems: the agent interrupting a user who is mid-sentence but briefly silent.

Turn-taking is the mechanism voice agents use to decide when to listen and when to speak. In every cascaded architecture (STT + LLM + TTS), the system needs to determine in real time when the user has finished their turn so the agent can respond. The dominant approach to solving this problem is Voice Activity Detection (VAD): an algorithm that monitors the audio stream and uses silence thresholds and rule sets to trigger the end-of-turn decision.

VAD-based turn-taking is what powers most production voice agents in 2026. It is also, according to Neil Zeghidour, CEO of Gradium, "the worst part of voice AI" and "the archaic era of handmade rules." This article covers what VAD-based turn-taking is, why it fails in natural conversation, how semantic VAD improves on it within cascade architecture, and what full duplex conversation looks like as the research-grade alternative.

What Is Turn-Taking in a Voice Agent?

In a turn-based voice agent, the conversation is modeled as alternating turns: the user speaks, the agent listens; the agent speaks, the user listens. Each party waits for the other to finish before responding. This is the model used by every cascaded voice agent architecture: STT transcribes the user's speech, an LLM generates a text response, TTS synthesizes that response into audio.

For this to work, the system needs to know when the user has finished their turn. It needs to answer two questions in real time:

Is the user currently speaking or silent?
Has the user finished their turn, or are they just pausing mid-thought?

The answer to both questions determines when the agent starts speaking. Get it wrong in one direction and the agent interrupts. Get it wrong in the other and it waits too long. Both failures make the interaction feel broken.

How Does Rule-Based VAD Work?

Voice Activity Detection is the algorithm used to answer those questions. In its basic form, VAD detects whether an audio signal contains speech or silence. A threshold of silence (500 milliseconds of no detected speech energy, for example) triggers the end-of-turn decision and hands control to the agent.

The problem is that this simple threshold fails constantly in real conversation. Humans pause mid-sentence. They breathe. They say "um" and "uh" while formulating thoughts. They trail off and pick up again. A flat silence threshold cannot distinguish between "I finished my turn" and "I am thinking" or "I am about to continue."

So developers add rules. If silence exceeds threshold X and the last word was a question word, treat it as end of turn. If silence exceeds threshold X but follows a number, wait longer. If the agent was just speaking and silence occurs within Y milliseconds, treat it as an interruption. If the agent detected background noise above level Z, ignore short silences.

Neil Zeghidour, CEO of Gradium, describes it precisely: "You have an algorithm called the voice activity detection algorithm that just says whether it's silent or not. And if it's silent more than x amount of milliseconds, then this rule and this rule and this rule, then it's an interruption. But if this happened, it's not an interruption. And so you have rules on top of rules on top of rules to decide whether the model should talk or not."

The structural parallel with early computer vision is direct. Before deep learning, image recognition relied on handcrafted feature detectors: algorithms designed to find edges at specific angles and textures at specific frequencies. These worked in constrained conditions and failed outside them. Machine learning replaced them because no finite set of rules could cover the variability of real-world inputs. Turn-taking is in the same position today. The variability of natural speech (pauses, breathing, trailing sentences, mid-thought silences) exceeds what a rule system can handle reliably, and each new rule added to extend coverage creates new boundary cases.

Why Does Rule-Based VAD Make Voice Agents Feel Unnatural?

The practical result of rule-based VAD in production is that the user must adapt to the system rather than the system adapting to the user. Users of cascaded voice agents learn to pace their speech artificially, avoid pauses, and clip their sentences to prevent the agent from interrupting or waiting at the wrong time. This behavioral adaptation is a direct product of VAD architecture: the system cannot distinguish between a user who has finished speaking and a user who is mid-thought but briefly silent.

Adding more rules does not resolve this. Natural speech contains pauses, restarts, thinking silences, trailing sentences, and overlapping emphasis. Each new rule introduced to handle one of these patterns creates new boundary cases. The coverage problem grows with each rule added. The practical consequence is that the user adapts to the agent rather than the agent adapting to the user.

What Is Semantic VAD and How Does It Improve Cascade Turn-Taking?

Between rule-based VAD and full duplex lies an intermediate approach: semantic voice activity detection. Instead of making the end-of-turn decision based purely on silence duration, semantic VAD uses the content of what the user is saying to determine whether their turn is complete. A model that understands language can recognize that a sentence ending with a question implies an expected response, while a sentence that trails mid-clause likely continues.

Semantic VAD reduces the most common failure modes of rule-based VAD: premature interruptions when a user pauses to think, and missed end-of-turns when a user's final sentence ends without trailing silence. It does not require silence as a primary signal. The agent can wait for a semantically complete thought rather than a long enough gap.

Gradium's streaming STT API includes semantic VAD as part of its architecture. Rather than triggering end-of-turn on silence alone, the model uses both acoustic and linguistic context to determine when the user has finished a complete thought. This reduces the most disruptive category of turn-taking failures in cascaded systems: the agent interrupting a user who is mid-sentence but briefly silent.

Semantic VAD is an improvement over rule-based VAD within the cascade architecture. It is not full duplex.

What Does Full Duplex Conversation Look Like? How Moshi Works

The research-grade solution to turn-taking is to remove turns entirely. Full duplex conversation means the model is always listening and always capable of speaking simultaneously, without a gate that decides when one party is allowed to begin.

Neil Zeghidour's team at Kyutai (the research lab that preceded and runs in parallel with Gradium) built and published Moshi, the first fully open full duplex conversational model. The core architectural idea, which Neil Zeghidour describes as "an extremely simple thing", is to model the conversation as two simultaneous token streams rather than one:

"We took the audio language model, instead of having it modeling one stream of tokens, we called it multistream: just two streams of tokens, one for the user, one for the AI. Both can be active at the same time and there is no turn-taking anymore. You train it on stereo data: people talking on the phone, one person on the left channel, one on the right. Your model models both at the same time."

— Neil Zeghidour, CEO of Gradium

The result is that the model can speak at any moment and continue listening while speaking. A user can pause for several seconds mid-thought without triggering an interruption, and the agent does not wait for a silence threshold to respond. In a full duplex system, the traditional definition of TTFT (time from end-of-turn to first response token) no longer applies, because turns no longer exist as a structure. The agent responds when it has something to say, not when a rule tells it the user has finished.

Moshi is open source, available at moshi.chat, and was built by a core team of four to six people in approximately six months.

Why Do Production Systems Still Use Cascade in 2026?

Despite Moshi's existence, virtually every production voice agent deployed today, including those built on Gradium's API, uses a cascaded architecture. Neil Zeghidour is direct about the reason:

"At the moment what we do is cascaded systems because that's where the market is right now. People are still iterating a lot on the underlying text models that they want to use, on tool use and so on, and there is so much progress on the text side."

The core constraint is modularity. In a cascade architecture, the LLM, STT, and TTS components are independent: each can be updated or replaced without retraining the others. In a speech-to-speech model, all three are integrated in a single trained system. Switching the underlying LLM requires fine-tuning the entire model again on speech data, which in a period of rapid LLM development (multiple major model releases per year) creates an unacceptable operational cost. Developers building in 2026 cannot afford to re-train a speech model every time a better LLM base becomes available.

This is a practical constraint, not a fundamental limitation of full duplex architecture. For the full architecture comparison, see Cascaded Voice Agents vs Speech-to-Speech.

Where Is the Architecture Going?

The direction is clear: turn-based conversation is a transitional architecture. The question is not whether full duplex will replace it, but when and how the integration constraints are resolved.

The remaining blocker is not modeling capability. Moshi demonstrated that full duplex conversation is technically achievable with a small team and modest compute. The blocker is integration: building a full duplex model that can accept backend LLM updates without requiring a full re-training cycle. Once that modular connection layer exists, full duplex can offer the same plug-and-play flexibility as cascade systems today, while removing the turn-taking constraint entirely.

Gradium's position is that cascade with semantic VAD is the right production architecture today.

What Are the Practical Implications for Voice Agent Developers in 2026?

For developers building voice agents today, the turn-taking architecture they choose has concrete consequences.

If using a cascade system with rule-based VAD: the agent will interrupt users who pause to think and require users to adapt their speaking style. The quality of turn-taking depends heavily on VAD configuration and is difficult to tune for natural conversation. This is the current baseline for most production deployments.

If using a cascade system with semantic VAD: end-of-turn detection uses linguistic context in addition to silence. The most disruptive failure mode (interrupting a user who is mid-thought) is reduced. Users do not need to adapt their cadence as much. The interaction remains turn-based but functions more naturally within that constraint.

If using a full duplex model: the turn-based structure disappears. Latency in the traditional sense no longer applies. The interaction behaves more like a phone conversation between humans. The current tradeoff is modularity: changing the underlying intelligence requires re-training the model.

The latency data from the Coval STT benchmark (May 2026) gives context for the cascade pipeline in production: the STT step alone contributes between 992 ms and 2,080 ms of TTFT depending on the model chosen, before the LLM and TTS steps are added. Semantic VAD reduces the end-of-turn detection error rate within that pipeline; it does not reduce the pipeline's fundamental latency floor.

How Should You Frame the Turn-Taking Choice?

Turn-taking is the mechanism through which most voice agents decide when to listen and when to speak. The dominant approach, Voice Activity Detection with rule-based thresholds, produces an interaction that requires the user to adapt to the agent. This is not a configuration issue. It is an architectural constraint: applying handcrafted rules to a problem that scales better with learned models trained on the right data.

Semantic VAD reduces the most disruptive failure modes within cascade architecture by adding linguistic context to the end-of-turn decision. Full duplex removes the concept of turns entirely, enabling a conversation that behaves like a phone call between two humans. Both approaches represent progress on the same underlying problem.

The research demonstrating that full duplex is achievable already exists in the form of Moshi, built by the founding team of Gradium at Kyutai.

For developers building voice agents today, the choice between rule-based VAD, semantic VAD, and eventual full duplex is a choice between three different points on a spectrum from rule-governed to learned conversation dynamics. Each has different cost, latency, naturalness, and modularity tradeoffs. The direction of travel is clear.