Cascaded Voice Agents vs Speech-to-Speech: Architecture Tradeoffs in 2026
Two architectures exist for building a voice agent today. The first is the cascade: a pipeline of three independent models (STT, LLM, TTS) that converts audio to text, generates a response, and converts text back to audio. The second is speech-to-speech: a single model that processes audio input and generates audio output directly, without text as an intermediate representation.
Both architectures are in active use. The cascade dominates production deployments in 2026. Speech-to-speech exists at the research and prototype stage. Understanding the tradeoffs between the two is relevant for any developer choosing an architecture for a voice agent today, or planning for the next generation of systems. This article covers how each architecture works, the specific advantages and limitations of each, and why the cascade is the current production standard despite the theoretical advantages of speech-to-speech.
How Does the Cascade Architecture Work?
In a cascaded voice agent, the conversation pipeline runs through three separate models in sequence:
- STT (Speech-To-Text): the user's audio is transcribed into text in real time. The STT model outputs a text stream as the user speaks.
- LLM (Large Language Model): the transcribed text is passed to a language model, which generates a text response. The LLM can be prompted, given tool access, and instructed just like any text-based AI application.
- TTS (Text-To-Speech): the LLM's text output is converted to audio and streamed back to the user.
Each model runs independently. The STT model does not know what the LLM will say. The LLM does not process audio. The TTS model receives text only.
Neil Zeghidour, CEO of Gradium, describes the core appeal of this architecture: it lets you "plug in the LLM and customize its behavior through its prompt, like you would do with a text model." Any prompt engineering, tool integration, or RAG setup that works with a text LLM works directly in a cascade voice agent without modification.
What Are the Advantages of the Cascade Architecture?
Modularity and LLM Flexibility
The most significant advantage of cascade is modularity. Each of the three components can be updated or replaced independently. Switching from GPT-4 to Claude, upgrading the STT model, or changing the TTS voice does not require touching the other components. In 2026, LLM development is moving fast: new base models and fine-tuned variants are released continuously. The cascade architecture allows developers to stay current with LLM improvements without re-training any speech model.
This is the primary reason cascade dominates production today. As Neil Zeghidour states: "At the moment what we do is cascaded systems because that's where the market is right now. People are still iterating a lot on the underlying text models that they want to use, on tool use and so on, and there is so much progress on the text side."
Prompt Customization
Because the LLM receives text and outputs text, every existing technique for LLM customization applies directly: system prompts, few-shot examples, function calling, retrieval-augmented generation, fine-tuning on domain-specific data. The voice interface is a wrapper around a standard LLM interaction. This makes cascade accessible to any team with LLM experience, without requiring expertise in speech model training.
Tool Use and Agent Behavior
Cascade voice agents can integrate with any tool or API that a text-based LLM agent can use: databases, calendars, CRM systems, knowledge bases. The text bottleneck is also an integration point: the LLM's text output can be parsed, logged, and processed by downstream systems using standard text tooling.
Streaming Compatibility
Cascade architecture is compatible with streaming at each stage. Modern STT APIs return transcription tokens in real time as the user speaks. Modern TTS APIs synthesize and stream audio as the LLM generates tokens. This allows the pipeline to minimize time-to-first-audio even though three models are running in sequence.
What Are the Limitations of the Cascade Architecture?
Cumulative Latency
The cascade runs three models sequentially. Each adds latency to the pipeline. The STT model must transcribe enough of the user's speech before the LLM can start generating. The LLM must generate enough tokens before TTS can start synthesizing. Even with streaming optimization at each stage, the latency of the full cascade is the sum of the latency floors of each component.
The Coval STT benchmark (May 2026) measures the STT component alone: median TTFT ranges from 992 ms (Deepgram Nova 3) to 2,080 ms (ElevenLabs Scribe v2) depending on the model. LLM generation and TTS synthesis add additional latency on top. The practical total pipeline latency for a cascade voice agent in production is typically 1.5 to 3 seconds from end of user speech to start of agent audio, depending on model choices and infrastructure.
Loss of Paralinguistic Information
Text is a lossy representation of speech. When audio is converted to text, all information carried in the voice beyond the words themselves is discarded. This information is called paralinguistic information: the emotional tone of the speaker, whether they sound confident or uncertain, whether they are frustrated or calm, whether they are being ironic.
Neil Zeghidour explains this directly: "By going through the bottleneck of text you lose what we call paralinguistic information, which is all the information we convey when we speak on top of what we say. Emotional states, irony... a lot of information is conveyed that is not in what to say."
For customer service voice agents, this means the agent has no access to the user's emotional state from the audio signal. It cannot detect that the user is getting frustrated, confused, or upset unless that state is expressed in the words themselves. In many cases, emotional information is in the voice, not the words.
Turn-Taking Constraints
The cascade architecture relies on Voice Activity Detection (VAD) to determine when the user has finished speaking. As covered in Turn-Taking in Voice Agents, rule-based VAD produces a system where the user must adapt to the agent rather than the agent adapting to the user. This is a structural limitation of the cascade: because the LLM processes text and responds in turns, the conversation must be structured as alternating turns, enforced by a silence-detection algorithm.
How Does Speech-to-Speech Architecture Work?
In a speech-to-speech model, a single neural network processes audio input and generates audio output directly. There is no STT step, no intermediate text representation, and no TTS step. The model has been trained end-to-end on audio data and learns to map audio inputs to audio outputs.
The architectural research foundation for modern speech-to-speech models is the audio language model (ALM) framework, developed originally through work at Google Brain and later at Kyutai. The core idea is to compress audio into discrete tokens using a neural codec, then train a language model to predict the next audio token, the same approach that works for text in LLMs. The audio codec (SoundStream, developed by Neil Zeghidour's team at Google) compresses audio efficiently enough that the token sequence becomes manageable for a language model.
Kyutai's Moshi model, released as open source in 2024, is the first fully open full duplex speech-to-speech conversational model. It processes two simultaneous audio token streams (one for the user, one for the model) and requires no turn-taking mechanism because both streams can be active at the same time.
What Are the Advantages of Speech-to-Speech Architecture?
No Cumulative Pipeline Latency
Because there is no STT, LLM, and TTS running sequentially, the latency model is fundamentally different. The speech-to-speech model processes audio and generates audio as a continuous stream. The time from user input to model output is a single model's inference time, not three models' combined latency.
Paralinguistic Information Preserved
A speech-to-speech model processes audio directly. It has access to the full audio signal including tone, emotion, pace, and all paralinguistic cues. It can modulate its own output voice to match the emotional context of the conversation. This enables a level of conversational naturalness that a text-based intermediate representation cannot achieve.
Full Duplex Conversation
Speech-to-speech models trained with a multistream architecture (like Moshi's two-channel approach) do not require turn-taking because both parties' audio streams are modeled simultaneously. The model can respond at any moment, speak while the user speaks, and handle overlapping speech naturally. The user does not need to adapt their cadence or pace to the system.
What Are the Limitations of Speech-to-Speech Architecture?
LLM Lock-In
The core limitation of speech-to-speech models in production is modularity. Because the speech model is trained end-to-end, the LLM component is not a plug-in: it is baked into the model weights. Updating the underlying language model requires re-training or fine-tuning the entire speech model on speech data. Neil Zeghidour explains: "One drawback of speech-to-speech models is that since everything is integrated, when you go from a text model to the speech-to-speech model you need to fine-tune it on speech data. So the cost to switch the underlying text model is extremely high because you will need to re-fine-tune everything from scratch. People want something that is modular, plug and play."
In 2026, when significant new LLM releases happen multiple times per year, this is a major operational constraint. A production voice agent built on a speech-to-speech model cannot easily switch to a better LLM when one becomes available.
Prompt Engineering Limitations
Because there is no explicit text prompt passed to a language model, the standard toolkit of prompt engineering (system prompts, few-shot examples, RAG, function calling) does not apply directly. Controlling the behavior of a speech-to-speech model requires different techniques, and the ecosystem of tools and practices for text LLM customization does not transfer.
Limited Production Availability
As of 2026, no speech-to-speech model is widely available as a production API. Moshi is open source and accessible but was designed as a research prototype. The production tooling, infrastructure, uptime guarantees, and enterprise support that exist for cascade components (STT APIs, LLM APIs, TTS APIs) do not yet exist for end-to-end speech-to-speech systems.
How Do Cascade and Speech-to-Speech Compare in One Table?
| Dimension | Cascade (STT + LLM + TTS) | Speech-to-Speech |
|---|---|---|
| Architecture | 3 independent models in sequence | 1 integrated model |
| Pipeline latency | Sum of 3 model latencies | Single model inference |
| LLM flexibility | Swap LLM without retraining | LLM change requires retraining |
| Prompt engineering | Full LLM prompt toolkit applies | Not directly applicable |
| Tool use / function calling | Supported via LLM | Not standard |
| Paralinguistic information | Lost at STT step | Preserved throughout |
| Turn-taking | Requires VAD rules | Full duplex possible |
| Production availability | Widely available (multiple providers) | Research/prototype stage |
| Customization | Via LLM prompt and fine-tuning | Via end-to-end training |
| Current Gradium offering | Streaming STT + streaming TTS | Research (Moshi/Kyutai) |
Where Does Gradium Stand Today?
Gradium provides cascade components: a streaming Speech-To-Text API and a streaming Text-To-Speech API, designed to work together in a single pipeline with shared billing and a single WebSocket architecture. The STT includes semantic VAD, which reduces the most common turn-taking failures of rule-based VAD by using linguistic context to determine end-of-turn rather than silence alone.
The speech-to-speech research is active at Kyutai, the open research lab co-founded by Neil Zeghidour (CEO of Gradium) and Alexandre Défossez (Chief Science Officer of Gradium). Moshi, released as open source, is the reference implementation of full duplex speech-to-speech.
The path Neil Zeghidour describes is a convergence: "What will solve everything is providing the same flexibility as cascaded systems so that you can change the backend on demand, and you get the same customizability that you have with text models but with the full duplex." The goal is a system that combines the modularity of cascade (swap the LLM, use standard prompt engineering) with the conversational quality of speech-to-speech (no turn-taking, paralinguistic information preserved).
Until that convergence is achieved, cascade with semantic VAD is the right production architecture.
Which Architecture Is Right for Your Voice Agent?
The choice between cascade and speech-to-speech in 2026 is largely determined by availability and operational requirements, not by preference.
Use cascade when:
- You need a production-ready system with defined uptime, SLAs, and vendor support
- You are iterating on the LLM layer and need to swap models without retraining speech components
- Your use case requires tool use, function calling, RAG, or structured LLM output
- Your team has LLM engineering experience and needs to apply it directly
- You need multilingual support with defined quality levels per language
Speech-to-speech may be relevant when:
- You are prototyping or researching next-generation conversation dynamics
- The specific quality of natural turn-taking and paralinguistic awareness is a hard requirement
- You have the capacity to train and maintain a custom end-to-end model
- The LLM layer is stable and does not need frequent updates
For most production voice agents in 2026, cascade is the only viable option. The components that make cascade work well (low-latency STT, expressive TTS, semantic VAD for turn-taking) are where the differentiation between providers lies. For the flagship voice-agent TTS coverage, see Best Text-To-Speech API for Voice Agents.
How Should You Frame the Architecture Choice?
Cascade and speech-to-speech represent two different approaches to the same problem: building a voice agent that can understand speech and respond naturally. Cascade wins on modularity, LLM flexibility, production availability, and ecosystem maturity. Speech-to-speech wins on conversational naturalness, paralinguistic awareness, and full duplex turn-taking.
In 2026, the production choice is cascade. The speech-to-speech advantages are real but not yet accessible in a production-grade, operationally viable form. The active research question, articulated by Gradium's founding team through their work on Moshi and through the Gradium API, is how to bring the modularity of cascade to a full duplex architecture: the ability to swap the LLM, use standard prompt engineering, and maintain uptime SLAs, while removing the turn-based constraint and preserving paralinguistic information.
For developers building voice agents today, understanding both architectures matters: cascade is the implementation choice now, and speech-to-speech defines the direction of the next generation.