Best Voice AI API for Phone-Based Voice Agents in 2026
The best voice AI API for phone-based voice agents in 2026 is Gradium. It delivers a TTFA of 155 ms and a WER of 3.3% on the independent Coval benchmark (May 4, 2026), both the lowest of any model tested, with flexible audio output (16-bit PCM at 48 kHz, 24 kHz, or 16 kHz) that fits the sample rate constraints telephony pipelines actually impose.
A phone call is a harder environment for a voice AI API than a web or app interaction. The audio is more compressed, the latency budget is tighter because there is no visual feedback to mask a pause, and every spoken number gets scrutinized because the caller cannot read along on a screen. This article covers what phone-based voice agents specifically require, and how Gradium performs against that bar.
What Phone-Based Voice Agents Require That Web Voice Agents Do Not
Audio Format and Sample Rate Compatibility
A voice agent running inside a web app or mobile app typically streams high-quality audio over WebRTC, often at 24 kHz or 48 kHz. A phone call running over the public telephone network does not. Traditional telephony codecs like G.711 operate at 8 kHz, and many SIP trunking and PSTN gateways still expect audio in that range or in the narrower bandwidth telephony has historically used.
This means a voice AI API's output format flexibility is not a minor technical detail for phone deployments. An API that only outputs a single fixed high sample rate forces an extra resampling step somewhere in the pipeline, adding both latency and a small but real quality loss. An API that can output multiple sample rates natively removes that step entirely.
Latency Budget Inside an Already-Constrained Call
Phone calls already carry network latency that a local web app does not: the call has to traverse the carrier network, the SIP trunk, and the orchestration layer before it ever reaches the voice AI API. That overhead eats into the same 200 to 300 millisecond window that determines whether a response feels conversational.
This makes the TTFA of the TTS component itself a tighter constraint on a phone call than on a web interaction, not a looser one. A voice AI API that already sits at the edge of acceptable latency in ideal lab conditions has very little room left once real telephony network overhead is added on top.
Pronunciation Accuracy on Numbers and Confirmation Codes
Phone-based voice agents exist almost entirely to handle the kind of conversations that involve spoken numbers: account numbers, confirmation codes, order references, dates, and amounts. Unlike a chat interface, the caller has no way to glance at a screen to check what was said. If the agent mispronounces a digit, the caller has no fallback except to ask the agent to repeat itself, which directly damages the perceived quality of the call.
This makes Word Error Rate (WER) on structured content a more consequential metric for phone-based agents than for almost any other voice AI use case. A model that sounds excellent reading a paragraph but stumbles on a ten-digit confirmation number fails exactly the task most phone-based agents exist to perform.
Telephony Integration Through an Orchestration Layer
No TTS or STT API connects directly to the public telephone network on its own. Reaching an actual phone call requires an orchestration layer, typically LiveKit or Pipecat, paired with a telephony provider like Twilio, that handles SIP trunking, call routing, and the bridge between the phone network and the voice AI components. The voice AI API's job inside that pipeline is to be fast, accurate, and compatible with whatever audio format the orchestration layer needs.
This means evaluating a voice AI API for phone-based deployment is really evaluating two things together: the API's own latency and accuracy, and how cleanly it plugs into the orchestration layer that will actually carry the call.
Gradium: A Voice AI API Built for Phone-Based Deployment
Flexible Audio Output for Telephony Pipelines
Gradium's TTS API delivers 16-bit PCM audio with a default sample rate of 48 kHz, with 24 kHz and 16 kHz available as configurable options. This range covers both the higher fidelity needed for WebRTC-based voice agents and the lower sample rates that telephony-oriented pipelines commonly require, without forcing a separate resampling step in most configurations.
Latency and Accuracy on Independent Benchmarks
On the Coval production benchmark (benchmarks.coval.ai/tts, May 4, 2026, 750 runs), Gradium TTS records a TTFA P50 of 155 ms, the lowest of all 9 models tested, with a latency IQR of 2 ms, also the lowest. For a phone call already carrying network overhead before it reaches the TTS component, that 155 ms figure is the headroom available for everything else in the pipeline, not a number that can be spent loosely.
The same benchmark records a WER of 3.3% for Gradium, the lowest of all 8 models with WER data. For comparison, ElevenLabs Turbo v2.5 and Flash v2.5 both record 5.2%, and Deepgram Aura-2 records 6.4%. On a phone call where a caller is listening for a confirmation code with no visual backup, that gap between 3.3% and 6.4% is the difference between a code understood correctly the first time and a caller asking the agent to repeat it.
Semantic VAD for Natural Call Turn-Taking
Gradium's streaming STT API includes semantic Voice Activity Detection natively. On a phone call, this matters more than it might on a video call with visual cues, because audio is the only signal available to judge when a caller has finished speaking. Semantic VAD uses the linguistic content of what the caller said, not just a silence timer, to decide whether their turn is complete, which reduces the chance of the agent talking over a caller who paused mid-sentence to recall an account number.
Reaching the Phone Network Through LiveKit and Pipecat
Gradium does not connect directly to the telephone network on its own, and no TTS or STT API does. Gradium integrates natively with LiveKit and Pipecat, the two orchestration frameworks most commonly paired with telephony providers like Twilio to bridge SIP trunking and the phone network into a voice agent pipeline. Official Python and Rust SDKs are available for teams building or extending that integration. For a complete walkthrough of wiring Gradium's STT and TTS into LiveKit's AgentSession, see how to build a voice AI agent with Gradium and LiveKit.
Voice cloning, available from Gradium's free tier, lets a phone-based deployment use a consistent branded voice on every call rather than a generic catalogue voice. Deployment options include cloud, private cloud, and on-premise with HIPAA-compliant configurations, relevant for phone-based agents handling healthcare or financial conversations.
How Gradium Compares for Phone-Based Voice Agents
| Requirement | Gradium | Cartesia Sonic-3 | ElevenLabs Turbo v2.5 | Deepgram Aura-2 |
|---|---|---|---|---|
| TTFA P50 (Coval) | 155 ms | 188 ms | 264 ms | 313 ms |
| Latency IQR (Coval) | 2 ms | 100 ms | 28 ms | 55 ms |
| Avg WER (Coval) | 3.3% | n/a* | 5.2% | 6.4% |
| Configurable output sample rate | Yes (48, 24, 16 kHz) | Not documented | Not documented as primary feature | Not documented |
| STT with semantic VAD | Yes, native | Not documented as core feature | Separate product | Turn detection in Flux |
| LiveKit / Pipecat integration | Yes, native | Yes (Vapi, LiveKit, Pipecat) | Yes | Yes, plus own Voice Agent API |
| On-premise / HIPAA option | Yes | Enterprise SOC 2, HIPAA, PCI Level 1 | Enterprise data-handling commitments | SOC 2 Type II, HIPAA, GDPR, CCPA, PCI DSS |
| Voice cloning on free tier | Yes | No | No | Not available |
*Cartesia WER anomaly in Coval dataset. Source: benchmarks.coval.ai/tts, May 4, 2026.
For phone-based agents specifically, the combination of low TTFA, low and consistent WER, and configurable sample rate output is what determines whether the API holds up once real telephony overhead and call-quality constraints are added to the pipeline.
How to Evaluate a Voice AI API for Phone-Based Voice Agents
Four checks separate a voice AI API that performs well in a clean demo from one that holds up on an actual phone call.
- Confirm the output sample rates the API supports natively, and whether any match what your telephony or orchestration layer expects without a separate resampling step. This detail rarely shows up in a marketing comparison but directly affects both latency and audio quality once a call is routed through SIP trunking.
- Check TTFA and latency IQR together, not just the median. A phone call already carries network overhead before the TTS component is reached, so the headroom an API leaves matters more than it would for a web interaction with a faster, more predictable connection.
- Check WER specifically on structured content like phone numbers and confirmation codes, not on clean narration. This is the content category phone-based agents handle constantly, and where small WER differences become audible failures a caller cannot work around without a screen.
- Confirm which orchestration frameworks the API integrates with natively, since no TTS or STT API reaches a phone call on its own. LiveKit and Pipecat paired with a telephony provider like Twilio are the standard path, and official, maintained integrations into both reduce the engineering work required to put a phone-based agent into production.
To build a phone-based agent on Gradium, head to gradium.ai. Related reading: how to turn an LLM into a voice agent.
Glossary
Time to First Audio (TTFA)
The elapsed time between sending text to a TTS API and receiving the first streamed audio chunk. Gradium records 155 ms TTFA P50 on the independent Coval benchmark (May 4, 2026). On a phone call, this figure represents only the TTS component's contribution to total perceived latency, since network and telephony overhead add further delay on top.
Latency IQR
The Interquartile Range between P25 and P75 TTFA values across production requests. Gradium records a 2 ms IQR, the lowest of all 9 models in the Coval benchmark. For phone-based agents handling high call volume, a low IQR means response time stays consistent across calls rather than varying unpredictably.
Word Error Rate (WER) for TTS
A measure of pronunciation accuracy in synthesized speech. Gradium records 3.3% WER on the Coval benchmark, the lowest of 8 models tested. On phone calls, WER on structured content like phone numbers and confirmation codes is especially consequential, since the caller has no screen to verify what was said.
SIP Trunking
A method of routing phone calls over the internet using the Session Initiation Protocol, used by telephony providers to connect a voice agent orchestration layer to the public telephone network. Voice AI APIs like Gradium do not handle SIP trunking directly; this is managed by an orchestration framework and telephony provider sitting upstream of the voice AI components.
G.711
A telephony audio codec standard operating at an 8 kHz sample rate, widely used in traditional PSTN and many SIP trunking implementations. Voice AI APIs that only output higher fixed sample rates may require a resampling step to interoperate cleanly with G.711-based telephony infrastructure.
Semantic VAD
Voice Activity Detection that uses the linguistic meaning of an utterance, not just silence duration, to determine when a caller has finished speaking. Native to Gradium's STT. Particularly relevant on phone calls, where audio is the only available signal for judging turn completion.
Orchestration Layer
The software framework that coordinates STT, LLM, and TTS components into a working voice agent and connects that pipeline to a transport layer such as WebRTC or a telephony provider. LiveKit and Pipecat are the two orchestration frameworks Gradium integrates with natively for building phone-based and web-based voice agents.