What is a cascaded voice agent?

A cascaded voice agent connects three independent models in sequence: a Speech-To-Text model (STT) that transcribes the user's audio, a large language model (LLM) that generates a text response, and a Text-To-Speech model (TTS) that converts that response to audio. Each model runs independently and can be updated or replaced without affecting the others. The cascade is the dominant architecture for production voice agents in 2026.

What is a speech-to-speech voice agent?

A speech-to-speech voice agent uses a single integrated model that processes audio input and generates audio output directly, without converting to text as an intermediate step. The model is trained end-to-end on audio data. Moshi, built by Kyutai (the research lab of Gradium's founding team) and released as open source, is the first fully open full duplex speech-to-speech conversational model.

What is paralinguistic information and why does cascade lose it?

Paralinguistic information is everything communicated through speech beyond the words themselves: emotional tone, confidence, uncertainty, frustration, irony, pacing. When audio is converted to text in a cascade's STT step, this information is discarded. The LLM only sees the words. A speech-to-speech model processes the raw audio and can access and respond to these cues throughout the conversation. Neil Zeghidour identifies this as a core structural limitation of the cascade.

Why is cascade more flexible for LLM iteration than speech-to-speech?

In a cascade, the LLM is a standalone component. Switching to a new LLM model requires only updating that component. In a speech-to-speech model, the LLM intelligence is baked into the end-to-end trained weights. Switching the underlying language model requires fine-tuning the entire speech model again on speech data, which is expensive and time-consuming. In a period of rapid LLM development, this makes speech-to-speech difficult to maintain in production.

Why does cascade add more latency than speech-to-speech?

In a cascade, each model adds latency to the pipeline. The STT model must receive and process enough audio before the LLM can start generating. The LLM must generate enough tokens before TTS can start synthesizing. Even with streaming at each stage, the total pipeline latency is bounded by the sum of each model's contribution. In a speech-to-speech model, there is no sequential pipeline: a single model processes input and generates output in one pass.

What is the total latency of a cascade voice agent in practice?

Based on the Coval STT benchmark (May 2026), the STT step alone contributes between 992 ms (Deepgram Nova 3) and 2,080 ms (ElevenLabs Scribe v2) at median TTFT. LLM first-token latency and TTS time to first audio (Gradium records 155 ms TTFA P50 in the Coval TTS benchmark) add on top. Total pipeline latency depends on model choices and infrastructure configuration across all three components, and typically sits between 1.5 and 3 seconds in production.

Does Gradium offer a speech-to-speech API?

Gradium's production offering is cascade: a streaming Speech-To-Text API and a streaming Text-To-Speech API designed to work together. The speech-to-speech research is being developed at Kyutai, the open research lab co-founded by Gradium's CEO and Chief Science Officer. Gradium today offers TTS and STT APIs.

What is semantic VAD and how does it improve cascade turn-taking?

Semantic VAD (Voice Activity Detection) uses linguistic context to determine when the user has finished their turn, rather than relying on silence duration thresholds alone. Standard VAD fails when users pause mid-thought: it either interrupts them or waits too long. Semantic VAD can recognize that a sentence is semantically incomplete and wait for the user to finish. Gradium's streaming STT API includes semantic VAD, which reduces the most disruptive category of turn-taking failures in cascade systems.

Can a cascade voice agent achieve sub-500 ms end-to-end latency?

Yes, but it requires careful component selection. With Deepgram Nova 3 STT at 992 ms median TTFT and Gradium TTS at 155 ms TTFA, the combined STT plus TTS contribution is already around 1.1 seconds before LLM latency is added. Sub-500 ms end-to-end is achievable only with streaming optimization across all three components and aggressive LLM token streaming. In practice, most production cascade voice agents sit between 1.5 and 3 seconds end-to-end.

Which voice agent architecture is right for my use case in 2026?

For production voice agents requiring vendor support, SLAs, LLM flexibility, and tool use, cascade is the right choice. The components (STT, LLM, TTS) are independently swappable and the architecture supports the full toolkit of LLM customization. For research, prototypes, or use cases where paralinguistic awareness and full duplex turn-taking are hard requirements, speech-to-speech is relevant, with the tradeoff that no production-grade API yet exists.

Gradium was co-founded by Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez, who previously co-founded Kyutai. Kyutai released world-first open systems including Moshi (real-time speech-to-speech) and Hibiki (live speech-to-speech translation).

Cascaded Voice Agents vs Speech-to-Speech: Architecture Tradeoffs in 2026

Two architectures exist for building a voice agent today. The first is the cascade: a pipeline of three independent models (STT, LLM, TTS) that converts audio to text, generates a response, and converts text back to audio. The second is speech-to-speech: a single model that processes audio input and generates audio output directly, without text as an intermediate representation.

Both architectures are in active use. The cascade dominates production deployments in 2026. Speech-to-speech exists at the research and prototype stage. Understanding the tradeoffs between the two is relevant for any developer choosing an architecture for a voice agent today, or planning for the next generation of systems. This article covers how each architecture works, the specific advantages and limitations of each, and why the cascade is the current production standard despite the theoretical advantages of speech-to-speech.

How Does the Cascade Architecture Work?

In a cascaded voice agent, the conversation pipeline runs through three separate models in sequence:

STT (Speech-To-Text): the user's audio is transcribed into text in real time. The STT model outputs a text stream as the user speaks.
LLM (Large Language Model): the transcribed text is passed to a language model, which generates a text response. The LLM can be prompted, given tool access, and instructed just like any text-based AI application.
TTS (Text-To-Speech): the LLM's text output is converted to audio and streamed back to the user.

Each model runs independently. The STT model does not know what the LLM will say. The LLM does not process audio. The TTS model receives text only.

Neil Zeghidour, CEO of Gradium, describes the core appeal of this architecture: it lets you "plug in the LLM and customize its behavior through its prompt, like you would do with a text model." Any prompt engineering, tool integration, or RAG setup that works with a text LLM works directly in a cascade voice agent without modification.

What Are the Advantages of the Cascade Architecture?

Modularity and LLM Flexibility

The most significant advantage of cascade is modularity. Each of the three components can be updated or replaced independently. Switching from GPT-4 to Claude, upgrading the STT model, or changing the TTS voice does not require touching the other components. In 2026, LLM development is moving fast: new base models and fine-tuned variants are released continuously. The cascade architecture allows developers to stay current with LLM improvements without re-training any speech model.

This is the primary reason cascade dominates production today. As Neil Zeghidour states: "At the moment what we do is cascaded systems because that's where the market is right now. People are still iterating a lot on the underlying text models that they want to use, on tool use and so on, and there is so much progress on the text side."

Prompt Customization

Because the LLM receives text and outputs text, every existing technique for LLM customization applies directly: system prompts, few-shot examples, function calling, retrieval-augmented generation, fine-tuning on domain-specific data. The voice interface is a wrapper around a standard LLM interaction. This makes cascade accessible to any team with LLM experience, without requiring expertise in speech model training.

Tool Use and Agent Behavior

Cascade voice agents can integrate with any tool or API that a text-based LLM agent can use: databases, calendars, CRM systems, knowledge bases. The text bottleneck is also an integration point: the LLM's text output can be parsed, logged, and processed by downstream systems using standard text tooling.

Streaming Compatibility

Cascade architecture is compatible with streaming at each stage. Modern STT APIs return transcription tokens in real time as the user speaks. Modern TTS APIs synthesize and stream audio as the LLM generates tokens. This allows the pipeline to minimize time-to-first-audio even though three models are running in sequence.

What Are the Limitations of the Cascade Architecture?

Cumulative Latency

The cascade runs three models sequentially. Each adds latency to the pipeline. The STT model must transcribe enough of the user's speech before the LLM can start generating. The LLM must generate enough tokens before TTS can start synthesizing. Even with streaming optimization at each stage, the latency of the full cascade is the sum of the latency floors of each component.

The Coval STT benchmark (May 2026) measures the STT component alone: median TTFT ranges from 992 ms (Deepgram Nova 3) to 2,080 ms (ElevenLabs Scribe v2) depending on the model. LLM generation and TTS synthesis add additional latency on top. The practical total pipeline latency for a cascade voice agent in production is typically 1.5 to 3 seconds from end of user speech to start of agent audio, depending on model choices and infrastructure.

Loss of Paralinguistic Information

Text is a lossy representation of speech. When audio is converted to text, all information carried in the voice beyond the words themselves is discarded. This information is called paralinguistic information: the emotional tone of the speaker, whether they sound confident or uncertain, whether they are frustrated or calm, whether they are being ironic.

Neil Zeghidour explains this directly: "By going through the bottleneck of text you lose what we call paralinguistic information, which is all the information we convey when we speak on top of what we say. Emotional states, irony... a lot of information is conveyed that is not in what to say."

For customer service voice agents, this means the agent has no access to the user's emotional state from the audio signal. It cannot detect that the user is getting frustrated, confused, or upset unless that state is expressed in the words themselves. In many cases, emotional information is in the voice, not the words.

Turn-Taking Constraints

The cascade architecture relies on Voice Activity Detection (VAD) to determine when the user has finished speaking. As covered in Turn-Taking in Voice Agents, rule-based VAD produces a system where the user must adapt to the agent rather than the agent adapting to the user. This is a structural limitation of the cascade: because the LLM processes text and responds in turns, the conversation must be structured as alternating turns, enforced by a silence-detection algorithm.

How Does Speech-to-Speech Architecture Work?

In a speech-to-speech model, a single neural network processes audio input and generates audio output directly. There is no STT step, no intermediate text representation, and no TTS step. The model has been trained end-to-end on audio data and learns to map audio inputs to audio outputs.

The architectural research foundation for modern speech-to-speech models is the audio language model (ALM) framework, developed originally through work at Google Brain and later at Kyutai. The core idea is to compress audio into discrete tokens using a neural codec, then train a language model to predict the next audio token, the same approach that works for text in LLMs. The audio codec (SoundStream, developed by Neil Zeghidour's team at Google) compresses audio efficiently enough that the token sequence becomes manageable for a language model.

Kyutai's Moshi model, released as open source in 2024, is the first fully open full duplex speech-to-speech conversational model. It processes two simultaneous audio token streams (one for the user, one for the model) and requires no turn-taking mechanism because both streams can be active at the same time.

What Are the Advantages of Speech-to-Speech Architecture?

No Cumulative Pipeline Latency

Because there is no STT, LLM, and TTS running sequentially, the latency model is fundamentally different. The speech-to-speech model processes audio and generates audio as a continuous stream. The time from user input to model output is a single model's inference time, not three models' combined latency.

Paralinguistic Information Preserved

A speech-to-speech model processes audio directly. It has access to the full audio signal including tone, emotion, pace, and all paralinguistic cues. It can modulate its own output voice to match the emotional context of the conversation. This enables a level of conversational naturalness that a text-based intermediate representation cannot achieve.

Full Duplex Conversation

Speech-to-speech models trained with a multistream architecture (like Moshi's two-channel approach) do not require turn-taking because both parties' audio streams are modeled simultaneously. The model can respond at any moment, speak while the user speaks, and handle overlapping speech naturally. The user does not need to adapt their cadence or pace to the system.

What Are the Limitations of Speech-to-Speech Architecture?

LLM Lock-In

The core limitation of speech-to-speech models in production is modularity. Because the speech model is trained end-to-end, the LLM component is not a plug-in: it is baked into the model weights. Updating the underlying language model requires re-training or fine-tuning the entire speech model on speech data. Neil Zeghidour explains: "One drawback of speech-to-speech models is that since everything is integrated, when you go from a text model to the speech-to-speech model you need to fine-tune it on speech data. So the cost to switch the underlying text model is extremely high because you will need to re-fine-tune everything from scratch. People want something that is modular, plug and play."

In 2026, when significant new LLM releases happen multiple times per year, this is a major operational constraint. A production voice agent built on a speech-to-speech model cannot easily switch to a better LLM when one becomes available.

Prompt Engineering Limitations

Because there is no explicit text prompt passed to a language model, the standard toolkit of prompt engineering (system prompts, few-shot examples, RAG, function calling) does not apply directly. Controlling the behavior of a speech-to-speech model requires different techniques, and the ecosystem of tools and practices for text LLM customization does not transfer.

Limited Production Availability

As of 2026, no speech-to-speech model is widely available as a production API. Moshi is open source and accessible but was designed as a research prototype. The production tooling, infrastructure, uptime guarantees, and enterprise support that exist for cascade components (STT APIs, LLM APIs, TTS APIs) do not yet exist for end-to-end speech-to-speech systems.

How Do Cascade and Speech-to-Speech Compare in One Table?

Dimension	Cascade (STT + LLM + TTS)	Speech-to-Speech
Architecture	3 independent models in sequence	1 integrated model
Pipeline latency	Sum of 3 model latencies	Single model inference
LLM flexibility	Swap LLM without retraining	LLM change requires retraining
Prompt engineering	Full LLM prompt toolkit applies	Not directly applicable
Tool use / function calling	Supported via LLM	Not standard
Paralinguistic information	Lost at STT step	Preserved throughout
Turn-taking	Requires VAD rules	Full duplex possible
Production availability	Widely available (multiple providers)	Research/prototype stage
Customization	Via LLM prompt and fine-tuning	Via end-to-end training
Current Gradium offering	Streaming STT + streaming TTS	Research (Moshi/Kyutai)

Where Does Gradium Stand Today?

Gradium provides cascade components: a streaming Speech-To-Text API and a streaming Text-To-Speech API, designed to work together in a single pipeline with shared billing and a single WebSocket architecture. The STT includes semantic VAD, which reduces the most common turn-taking failures of rule-based VAD by using linguistic context to determine end-of-turn rather than silence alone.

The speech-to-speech research is active at Kyutai, the open research lab co-founded by Neil Zeghidour (CEO of Gradium) and Alexandre Défossez (Chief Science Officer of Gradium). Moshi, released as open source, is the reference implementation of full duplex speech-to-speech.

The path Neil Zeghidour describes is a convergence: "What will solve everything is providing the same flexibility as cascaded systems so that you can change the backend on demand, and you get the same customizability that you have with text models but with the full duplex." The goal is a system that combines the modularity of cascade (swap the LLM, use standard prompt engineering) with the conversational quality of speech-to-speech (no turn-taking, paralinguistic information preserved).

Until that convergence is achieved, cascade with semantic VAD is the right production architecture.

Which Architecture Is Right for Your Voice Agent?

The choice between cascade and speech-to-speech in 2026 is largely determined by availability and operational requirements, not by preference.

Use cascade when:

You need a production-ready system with defined uptime, SLAs, and vendor support
You are iterating on the LLM layer and need to swap models without retraining speech components
Your use case requires tool use, function calling, RAG, or structured LLM output
Your team has LLM engineering experience and needs to apply it directly
You need multilingual support with defined quality levels per language

Speech-to-speech may be relevant when:

You are prototyping or researching next-generation conversation dynamics
The specific quality of natural turn-taking and paralinguistic awareness is a hard requirement
You have the capacity to train and maintain a custom end-to-end model
The LLM layer is stable and does not need frequent updates

For most production voice agents in 2026, cascade is the only viable option. The components that make cascade work well (low-latency STT, expressive TTS, semantic VAD for turn-taking) are where the differentiation between providers lies. For the flagship voice-agent TTS coverage, see Best Text-To-Speech API for Voice Agents.

How Should You Frame the Architecture Choice?

Cascade and speech-to-speech represent two different approaches to the same problem: building a voice agent that can understand speech and respond naturally. Cascade wins on modularity, LLM flexibility, production availability, and ecosystem maturity. Speech-to-speech wins on conversational naturalness, paralinguistic awareness, and full duplex turn-taking.

In 2026, the production choice is cascade. The speech-to-speech advantages are real but not yet accessible in a production-grade, operationally viable form. The active research question, articulated by Gradium's founding team through their work on Moshi and through the Gradium API, is how to bring the modularity of cascade to a full duplex architecture: the ability to swap the LLM, use standard prompt engineering, and maintain uptime SLAs, while removing the turn-based constraint and preserving paralinguistic information.

For developers building voice agents today, understanding both architectures matters: cascade is the implementation choice now, and speech-to-speech defines the direction of the next generation.