How to Turn an LLM into a Voice Agent: Best Stack 2026

10 min read

The best stack to turn an LLM into a voice agent in 2026 is a cascade architecture: your LLM of choice, paired with a voice layer that handles Speech-To-Text and Text-To-Speech, connected through an orchestration framework like LiveKit or Pipecat. Gradium provides that voice layer, with a TTFA of 155 ms and a WER of 3.3% on the independent Coval benchmark (May 4, 2026), both the lowest of any model tested, plus native semantic Voice Activity Detection.

This is the architecture used by the large majority of production voice agents today, and it works with any LLM you already have, whether that is GPT, Claude, Gemini, or a self-hosted model. This article covers what each layer of the stack does, why this combination is the right default, and how to actually build it.

What Turns a Text LLM into a Voice Agent

The Cascade Architecture

A text LLM understands and generates text. It has no native way to hear or speak. Turning it into a voice agent means wrapping it with two additional components in sequence: a Speech-To-Text (STT) model that transcribes what the user says into text the LLM can read, and a Text-To-Speech (TTS) model that converts the LLM's text response back into audio the user can hear.

This three-stage pipeline, STT, LLM, TTS, is called a cascade. Each component runs independently. The STT model does not know what the LLM will say. The LLM never touches audio directly. The TTS model only receives text. The appeal of this setup is that it lets a developer plug in any LLM and customize its behavior through a prompt, exactly as they would with a text-only application.

Why Cascade Is the Right Choice for Most Teams in 2026

The alternative to cascade is a Speech-To-Speech model, where a single model processes audio in and generates audio out with no intermediate text step. Speech-To-Speech preserves more of the emotional and tonal information in someone's voice, and it removes the turn-taking constraints a cascade has to manage explicitly. But it has a structural limitation for most production teams: the LLM intelligence is baked into the model's trained weights, so changing the underlying LLM means retraining, not swapping a component.

Cascade does not have that limitation. This is where the market actually is right now: teams are still iterating heavily on the LLM layer itself, on tool use, and on prompt engineering, and a cascade lets that iteration happen without touching the speech components at all.

The Best Stack to Turn Your LLM into a Voice Agent

The LLM Layer: Bring Whatever You Already Use

The cascade architecture's main advantage is that the LLM layer is just a normal LLM. Every existing technique applies without modification: system prompts, few-shot examples, function calling, retrieval-augmented generation, and fine-tuning on your own data. There is no special "voice version" of an LLM required. GPT-4.1, Claude, Gemini, or a self-hosted open-weights model all work the same way inside this stack, because the LLM only ever sees text in and produces text out.

The STT Layer: Gradium with Semantic VAD

The STT layer's job is to turn the user's speech into text quickly and accurately, and to decide when the user has actually finished talking. Gradium's streaming STT API includes semantic Voice Activity Detection natively, which means the end-of-turn decision is based on the linguistic content of what was said, not just a fixed silence timer. This is what prevents the agent from interrupting a user who paused mid-sentence to think.

The TTS Layer: Gradium Streaming Audio

The TTS layer converts the LLM's response into audio the user hears, and it needs to do this fast enough that the response feels conversational rather than delayed. On the independent Coval production benchmark (benchmarks.coval.ai/tts, May 4, 2026, 750 runs), Gradium TTS records a TTFA P50 of 155 ms, the lowest of all 9 models tested, with a latency IQR of 2 ms, also the lowest. The same benchmark records a WER of 3.3%, the lowest of all 8 models with WER data, which matters directly when the LLM's response includes a number, a name, or a confirmation code that needs to come out right the first time.

The Orchestration Layer: LiveKit or Pipecat

No LLM, STT, or TTS API wires itself into a working voice agent on its own. An orchestration framework manages the real-time audio transport, coordinates the three components, and handles details like interruption logic and reconnection. LiveKit and Pipecat are the two most widely used orchestration frameworks for this purpose, and both have official Gradium integrations. LiveKit also provides the path to telephony deployment when a voice agent needs to run over an actual phone line, covered in more depth in best voice AI API for phone-based voice agents.

Why Gradium Is the Right Voice Layer for This Stack

Three properties make Gradium specifically well suited to sit inside a cascade built around any LLM.

It is fast enough to leave room for the LLM. A complete voice agent turn is the sum of STT, LLM, and TTS latency. With Gradium's TTS contributing 155 ms P50 and its STT contributing low, consistent latency on top, more of the total latency budget remains available for the LLM itself to generate a response, which matters more as LLM responses get longer or involve tool calls.

It does not require choosing a bundled platform over your own stack. Gradium builds voice models and APIs rather than its own end-to-end voice agent platform, so it plugs into LiveKit, Pipecat, or a custom orchestration layer without asking a team to migrate away from infrastructure they already use.

It covers both ends of the pipeline from one provider with one architecture. Gradium's TTS and STT were built together from the start, sharing the same streaming WebSocket design, rather than one being added on top of a product whose primary lineage is the other. Voice cloning, available from the free tier, also lets the resulting agent use a consistent branded voice rather than a generic catalogue voice. For the full benchmark detail behind these latency figures, see best Text-To-Speech API for voice agents.

How the Stack Compares to Alternatives

Stack component Gradium Cartesia ElevenLabs Deepgram
TTS TTFA P50 (Coval) 155 ms 188 ms (Sonic-3) 264 ms (Turbo v2.5) 313 ms (Aura-2)
TTS latency IQR (Coval) 2 ms 100 ms 28 ms 55 ms
TTS avg WER (Coval) 3.3% n/a* 5.2% 6.4%
STT with semantic VAD Yes, native Not documented as core feature Separate product (Scribe) Turn detection in Flux
Platform approach Voice models, plugs into any LLM and orchestrator Voice models plus Line agent platform Voice models plus own Conversational AI platform Voice models plus own Voice Agent API
LiveKit / Pipecat integration Yes, native Yes (Vapi, LiveKit, Pipecat) Yes Yes, plus own bundled API
Voice cloning on free tier Yes No No Not available

*Cartesia WER anomaly in Coval dataset. Source: benchmarks.coval.ai/tts, May 4, 2026.

A relevant distinction in this table is platform philosophy. ElevenLabs and Deepgram have each built their own bundled voice agent platform and position it as the primary way to use their models. Gradium and Cartesia stay closer to a pure voice-model provider role, which matters for teams that want to assemble their own LLM choice, orchestration layer, and voice provider independently rather than adopting a single vendor's full stack. For a detailed head-to-head on any of these providers, see Gradium vs ElevenLabs, Gradium vs Cartesia, and Gradium vs Deepgram.

Two Ways to Build: A Full Setup or a Fast Prototype

Production Setup with LiveKit

For a production-grade build, Gradium's STT and TTS plug directly into LiveKit's AgentSession as a single install: pip install "livekit-agents[gradium]~=1.3". This gives you Gradium STT and TTS as ready-to-use plugins, with parameters to configure semantic VAD sensitivity, allow user interruptions, and enable preemptive LLM generation so the model starts forming a response before the user has finished speaking. The full walkthrough, from environment setup through deployment to LiveKit Cloud, is in how to build a voice AI agent with Gradium and LiveKit.

Fast Prototype with Gradbot

For a quick prototype or a hackathon-style build, Gradbot is Gradium's open-source framework designed to get a working voice agent running in under 50 lines of Python with any OpenAI-compatible LLM. It handles VAD, turn-taking, fillers, and interruptions automatically, so the developer only needs to define the agent's instructions and any tools it should call. Gradbot is not intended to replace a production orchestration framework like LiveKit, but it is a fast way to validate an LLM and voice combination before committing to a full build. To start, head to gradium.ai.

Glossary

Cascade Architecture

A voice agent design connecting three independent models in sequence: Speech-To-Text, an LLM, and Text-To-Speech. Each component can be swapped or upgraded independently. The dominant production architecture in 2026 because it lets teams iterate on the LLM layer without retraining any speech component.

Time to First Audio (TTFA)

The elapsed time between sending text to a TTS API and receiving the first streamed audio chunk. Gradium records 155 ms TTFA P50 on the independent Coval benchmark (May 4, 2026), the lowest of all 9 models tested.

Word Error Rate (WER) for TTS

A measure of pronunciation accuracy in synthesized speech. Gradium records 3.3% WER on the Coval benchmark, the lowest of 8 models tested. Matters directly in this stack whenever the LLM's response includes a number, name, or identifier that needs to be spoken correctly.

Semantic VAD

Voice Activity Detection that uses the linguistic meaning of an utterance, not just silence duration, to determine when a user has finished speaking. Native to Gradium's STT. Reduces premature interruptions when a user pauses mid-thought, independent of which LLM is generating the agent's responses.

Orchestration Framework

The software layer that coordinates STT, LLM, and TTS into a working real-time pipeline and manages the audio transport connecting them to the user. LiveKit and Pipecat are the two orchestration frameworks with official Gradium integrations, used to assemble the stack described in this article.

Speech-To-Speech Architecture

An alternative to cascade where a single model processes audio input and generates audio output directly, with no intermediate text representation. Preserves paralinguistic information that cascade discards but requires retraining to change the underlying LLM intelligence, rather than swapping a component.

Preemptive Generation

A configuration where the LLM begins generating a response before the user has fully finished speaking, reducing the perceived response latency of the overall voice agent. Available as a parameter when using Gradium's STT and TTS inside LiveKit's AgentSession.

Frequently Asked Questions