What is the best TTS API for real-time voice agents in 2026?

For real-time voice agents, the key criteria are TTFA under 250 ms, WebSocket-based streaming, and ideally an integrated STT for turn detection. Gradium, Cartesia Sonic-3, and ElevenLabs Flash v2.5 are the strongest options. Gradium is the only provider in this comparison that offers a fully unified TTS, STT, and voice cloning stack through a single WebSocket API.

What is TTFA in Text-To-Speech and why does it matter?

TTFA (Time to First Audio) is the time between sending text to a TTS API and receiving the first audio chunk. In a voice agent pipeline combining STT, LLM, and TTS, the total round-trip should stay under 800 ms for conversation to feel natural. That leaves the TTS layer a practical budget of 150 to 250 ms. TTFA is the primary latency metric for evaluating TTS APIs in interactive applications.

Which TTS APIs support voice cloning in 2026?

Gradium offers Instant Voice Cloning from 10 seconds of audio (available on the free tier) and Professional Voice Cloning from the M plan. ElevenLabs offers Instant on paid plans and Professional on Scale and above. Cartesia offers Instant from 10 seconds. Deepgram Aura-2 and OpenAI standard TTS do not support voice cloning. Always obtain explicit consent from the person whose voice is being cloned before using any voice cloning API.

What is the difference between WebSocket streaming and HTTP streaming in TTS?

WebSocket streaming maintains a persistent bidirectional connection across multiple turns, saving connection overhead (roughly 40 to 50 ms) on every request after the first. HTTP chunk transfer encoding streams audio within a single request but requires a new connection per request. For single-shot narration, HTTP streaming is sufficient. For multi-turn voice agents, WebSocket streaming reduces per-turn latency and simplifies session management.

Which TTS API supports the most languages?

Cartesia Sonic-3 supports 40+ languages, making it the broadest option in this comparison. ElevenLabs supports 32 languages across Flash v2.5 and Multilingual v2. Deepgram Aura-2 supports 7 languages. Gradium supports 5 languages (English, French, Spanish, German, Portuguese) with native fluency and mid-sentence code-switching. OpenAI TTS supports multiple languages but documentation states voices are currently optimized for English.

Can I test these TTS APIs for free?

Yes. Gradium offers 45,000 credits per month on its free tier (approximately 1 hour of TTS or 4 hours of STT) with 5 Instant Voice Clones and no credit card required. Cartesia offers 20,000 free credits per month. Deepgram offers $200 in one-time credits. OpenAI API accounts receive $5 in one-time credits (expires after 3 months). ElevenLabs offers a free tier with limited monthly character usage.

What is the cheapest TTS API for high-volume usage?

For straight per-character pricing, Deepgram Aura-2 at $0.030 per 1,000 characters is among the most competitive rates for standard quality. Gradium's credit system at $13/month for approximately 5 hours of TTS is cost-efficient when TTS and STT are bundled in the same budget. Cartesia at approximately $0.03 per minute is competitive for workloads where pricing per minute of audio aligns with your volume profile.

Does Gradium offer an on-device TTS option?

Yes. Gradium's Phonon model runs on a standard smartphone CPU without a server connection, designed for edge deployments in gaming, mobile applications, and robotics. This is a distinct capability not offered by ElevenLabs, Cartesia, or OpenAI TTS in their standard developer offerings.

What is the difference between ElevenLabs Flash v2.5 and Multilingual v2?

Flash v2.5 is optimized for real-time applications with 288 ms P50 TTFA on the Coval benchmark, 32 languages, and $0.05 per 1,000 characters. Multilingual v2 is optimized for maximum voice quality with 1,232 ms P50 TTFA on the Coval benchmark, 32 languages, and $0.10 per 1,000 characters. Flash v2.5 is the right choice for voice agents and interactive applications. Multilingual v2 is suited for narration, dubbing, and content where quality takes priority over latency.

Gradium was founded by Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez, co-founders of Kyutai, the Paris-based non-profit AI research lab. At Kyutai they released Moshi (real-time Speech-To-Speech) and Hibiki (live Speech-To-Speech translation). Before that, the team was at Meta, Google DeepMind, and Google Brain.

What SDKs and integrations does Gradium support?

Gradium provides official SDKs in Python and Rust. It ships official integrations for LiveKit and Pipecat, the two most widely used voice agent orchestration frameworks. Deployment is supported via cloud marketplace, private cloud, on-premise, and on-device through the Phonon model, with the same model and API across all four options.

How does Gradium pricing compare across plans?

Gradium uses credit-based pricing: 1 character of TTS equals 1 credit, 1 second of STT equals 3 credits. The free plan includes 45,000 credits per month with 5 Instant Voice Clones. Paid plans start at $13/month (XS) and scale to $1,615/month (L). Pay-as-you-go credits are available on all paid plans. Full plan details are at gradium.ai/pricing.

What is Semantic VAD and does Gradium include it?

Semantic Voice Activity Detection determines turn-taking in voice agents based on semantic content (whether the speaker has finished a thought) rather than acoustic silence alone. Acoustic VAD tends to misfire on pauses inside sentences. Gradium includes Semantic VAD inside its STT, removing the need for a separate VAD service in the voice agent pipeline.

Which TTS API handles structured data like phone numbers and dates correctly?

Gradium's TTS is specifically designed for structured data: phone numbers, URLs, email addresses, dates, times, street addresses, order IDs, and named entities. These are consistent failure points for generic TTS models in voice agent deployments where users frequently confirm such data over voice.

How do I get started with Gradium?

Sign up at gradium.ai. The free plan includes 45,000 credits per month and 5 Instant Voice Clones with no credit card required. SDKs are available in Python and Rust. LiveKit and Pipecat integrations are documented in the developer guides. Seed-funded teams can apply to the Startup Program for $2,000+ in credits and 6 months of full API access.

Best Text-To-Speech APIs in 2026: Developer Guide

Choosing a Text-To-Speech API is an infrastructure decision that affects every interaction your product has with a user. Get it wrong and you either pay several times more than necessary, accept latency that breaks conversation flow, or find yourself unable to ship the personalized voice experience your product needs.

This guide compares five leading TTS APIs available to developers in 2026: Gradium, ElevenLabs, Cartesia, Deepgram Aura-2, and OpenAI TTS. Each is evaluated across the criteria that determine production readiness: streaming architecture, time to first audio, voice cloning, language support, and pricing.

How Do the Top TTS APIs Compare at a Glance?

Dimension	Gradium	ElevenLabs (Flash v2.5)	Cartesia (Sonic-3)	Deepgram (Aura-2)	OpenAI (tts-1)
TTFA (P50)	155 ms (Coval); 258 ms self-reported end-to-end, 214 ms excluding connection establishment (published benchmark)	288 ms (Coval)	188 ms (Coval)	313 ms (Coval)	Not published
Voice cloning	Instant Voice Cloning + Professional Voice Cloning. Gradium's Instant Voice Clone has the highest Elo score in a blinded human evaluation benchmark against ElevenLabs across English, French, Spanish, and German	Instant (paid plans), Professional (Scale and above)	Instant from 10 s	None	None publicly available (Voice Engine in limited preview)
Voice library	Curated library of voices suited for voice agents	Large library, content-creation oriented	Library with emotional expressiveness controls	Library oriented toward agent and IVR	Small set of built-in voices
Languages	English, French, Spanish, German, Portuguese, with regular updates	32 languages	40+ languages	7 languages	Multiple, optimized for English
Native STT	Yes, streaming with Semantic VAD	Separate product (Scribe)	Separate product (Ink)	Yes (Nova-3)	Realtime API (bundled, not modular)
On-device	Yes (Phonon, smartphone CPU)	No	No	On-premise	No
Founders	Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez, co-founders of Kyutai	Mati Staniszewski, Piotr Dąbkowski	Karan Goel, Albert Gu	Scott Stephenson, Noah Shutty, Adam Sypniewski	Sam Altman, Greg Brockman, et al.
Free plan	$0/month, 45,000 credits	Free tier with limited monthly characters	20,000 free credits/month	$200 one-time credits	$5 one-time credits

What Is a TTS API?

A Text-To-Speech API is a programmatic interface that converts written text into synthesized speech audio. Developers send text to an endpoint and receive audio output, either as a complete file or as a stream delivered incrementally.

In 2026, the leading TTS APIs are built on neural models trained on large volumes of human speech. The resulting audio quality, latency, and language support vary significantly between providers, and the right choice depends on your specific use case. Real-time voice agents, content creation and narration, and general-purpose developer tooling each have different technical requirements.

What Should You Look for in a TTS API in 2026?

Time to First Audio

TTFA measures the time between sending text to the API and receiving the first audio chunk. In real-time voice agent pipelines, TTFA is the primary determinant of perceived responsiveness.

A complete STT, LLM, and TTS pipeline needs to stay under 800 ms for conversation to feel natural to a user. That leaves the TTS layer a practical budget of 150 to 250 ms. Providers that exceed 300 ms TTFA create a noticeable and disruptive delay in live voice interactions.

For content creation and narration use cases, TTFA matters less than total audio quality and voice naturalness.

Streaming Architecture: WebSocket vs HTTP

A genuine streaming TTS API starts delivering audio before synthesis of the full text is complete. This is critical for voice agents but also reduces time-to-first-playback in any latency-sensitive application.

Not all streaming implementations are equivalent. WebSocket-based streaming uses a persistent bidirectional connection that stays open across multiple conversation turns. In multi-turn voice agents, this saves the connection overhead (approximately 40 to 50 ms) on every exchange after the first. HTTP chunk transfer encoding streams audio within a single request but closes the connection afterward, requiring a new handshake on each turn. See our deep dive on multiplexing TTS over a WebSocket connection for the implementation details.

Voice Cloning

Voice cloning enables creating a custom synthetic voice from a short audio sample. For teams building branded voice experiences, AI companions, or personalized agents, voice cloning determines whether you can ship a differentiated product. The trade-offs between fast and high-fidelity clones are covered in our guide to Instant vs Pro voice cloning.

Key evaluation variables include the minimum audio sample duration required, clone quality at streaming speed, whether instant cloning is accessible on entry-level plans, and data handling terms.

Language Support

Language coverage ranges from 5 to 40+ languages depending on the provider. For global products, voice quality consistency across languages matters as much as the raw language count. Some providers list broad coverage with varying per-language quality. Others support fewer languages but with higher fidelity in each.

Pricing Structure

TTS APIs use three main pricing models:

Per character. Billed based on the number of characters synthesized.
Per minute of audio. Billed based on duration of generated audio.
Credit-based. A monthly credit allocation with defined conversion rates.

Per-character pricing is most predictable for text-heavy workloads. Credit-based systems can be cost-efficient when TTS and STT usage is bundled under the same credits.

Who Is Gradium?

Gradium is a real-time voice AI platform developed as a commercial spin-off of Kyutai, a Paris-based non-profit AI research lab. It was founded by Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez, co-founders of Kyutai, previously at Meta, Google DeepMind, and Google Brain. The same team released Moshi (real-time Speech-To-Speech) and Hibiki (live Speech-To-Speech translation) while at Kyutai.

The platform offers a unified TTS, STT, and voice cloning API stack built around WebSocket streaming. Official SDKs are available in Python and Rust, with integrations for LiveKit and Pipecat. Deployment spans cloud marketplace, private cloud, on-premise, and on-device, with the same model and API across all four.

Who Is ElevenLabs?

ElevenLabs is the most widely recognized TTS provider for content creation. The platform offers two distinct TTS models: Flash v2.5 for latency-sensitive applications and Multilingual v2 for maximum voice quality. Flash v2.5 supports 32 languages, Multilingual v2 supports 29. ElevenLabs does not offer native STT (Scribe is a separate product), and per-minute pricing for low-latency TTS is gated behind the Business tier.

Who Is Cartesia?

Cartesia built its TTS technology on State Space Model (SSM) architecture rather than standard transformers. The result is consistent low latency, including at P99, which matters for production systems where tail latency determines the worst user experiences. Cartesia's product line includes Sonic-3 (TTS), Ink-Whisper (STT), and Line (voice agent platform).

Who Is Deepgram?

Deepgram is primarily a Speech-To-Text provider. Aura-2 is its TTS model, positioned as the natural pairing for teams already using Deepgram Nova for transcription. Deepgram also offers a Voice Agent API that bundles STT (Nova-3), TTS (Aura-2), and LLM orchestration. SOC 2 Type II, HIPAA, GDPR, CCPA, and PCI DSS certifications are available, with on-premise deployment supported.

Who Is OpenAI?

OpenAI offers TTS via its Audio API through three models: tts-1, tts-1-hd, and gpt-4o-mini-tts. Voices are currently optimized for English per the official documentation. OpenAI's Realtime API (gpt-realtime-1.5, gpt-realtime-mini) handles Speech-To-Speech end-to-end via WebSocket, but it is not a modular TTS endpoint. It bundles STT, LLM reasoning, and TTS in a single model.

What Are the Key Differences Across the Top TTS APIs?

Gradium TTS: Streaming Text-To-Speech for Real-Time Applications

Gradium's TTS API delivers 16-bit PCM audio at 48 kHz by default (16 kHz and 24 kHz options available). Audio is streamed incrementally over a persistent WebSocket connection, meaning playback can begin before synthesis of the full text is complete.

On the independent Coval benchmark (benchmarks.coval.ai/tts), Gradium reaches 155 ms P50 TTFA. On Gradium's own published benchmark measured from Paris (15 to 25 word sentence, WebSocket, 100 queries, warm), Gradium reaches P50 258 ms, P95 274 ms end-to-end, and P50 214 ms, P95 228 ms excluding connection establishment. With WebSocket connection multiplexing (reusing a single connection across conversation turns), this saves approximately 50 ms per turn in multi-turn agents.

The platform offers configurable quality and latency tradeoffs via codebook settings:

Codebooks	TTFA	Audio-to-real-time ratio	Recommended use
32	228 ms	4.39x	Premium voice agents
16	185 ms	6.16x	High-volume deployments
8	160 ms	7.71x	Notifications, alerts

Gradium's TTS is specifically designed for structured data (phone numbers, dates, email addresses, URLs, order IDs, named entities), a consistent failure point for generic TTS models in voice agent deployments. See our guides on text normalization for TTS edge cases, pronunciation dictionaries, and how to use json_config to control TTS and STT behavior for the levers available. Word-level timestamps are emitted natively, enabling precise text-audio synchronization for subtitles, lip-sync, and turn-tracking.

Gradium STT: Speech-To-Text with Semantic VAD

Gradium ships streaming STT alongside its TTS, with Semantic Voice Activity Detection included. Semantic VAD determines turn-taking based on whether the speaker has finished a thought, rather than acoustic silence alone, which removes the need for a separate VAD service in the voice agent pipeline. This is the foundation for the unified TTS, STT, and voice cloning stack covered in our voice agent guide with LiveKit and the audiobook agent with Pipecat.

ElevenLabs Flash v2.5 and Multilingual v2

Flash v2.5 reaches 288 ms P50 TTFA on the Coval benchmark and supports 32 languages, priced at 0.5 credits per character (credit-based, varies by plan). It is designed for real-time conversational AI. Multilingual v2 reaches 1,232 ms P50 TTFA on the same benchmark with 29-language coverage, at 1 credit per character (credit-based, varies by plan). Multilingual v2 prioritizes voice quality over latency and is not recommended for real-time voice agents. Word-level timestamps are available on both. For a head-to-head with Gradium, see our ElevenLabs alternative comparison.

Cartesia Sonic-3

Sonic-3 reaches 188 ms P50 TTFA on the Coval benchmark with support for 40+ languages and regional accents. The API is available via REST and WebSocket. Cartesia includes emotional expressiveness controls and Instant voice cloning from 10 seconds of audio. Plans are credit-based: Pro ($4/month, 100K credits, annual billing), Startup ($39/month, 1.25M credits, annual billing), Scale ($239/month, 8M credits, annual billing). For a head-to-head with Gradium, see our Cartesia alternative comparison.

Deepgram Aura-2

Aura-2 reaches 313 ms P50 TTFA on the Coval benchmark with support for 7 languages (English, Spanish, French, German, Dutch, Italian, Japanese) over WebSocket streaming. Aura-2 does not offer voice cloning. Pricing is $0.030 per 1,000 characters ($0.027 at Growth tier). For a head-to-head with Gradium, see our Deepgram alternative comparison.

OpenAI TTS

OpenAI delivers audio via HTTP chunk transfer encoding, not a persistent WebSocket for the standard TTS endpoint. Multiple built-in voices are available, currently optimized for English per official documentation. There is no publicly available voice cloning (Voice Engine remains in limited preview as of 2026). Pricing: tts-1 at $15 per 1M characters, tts-1-hd at $30 per 1M characters, gpt-4o-mini-tts on token-based pricing ($0.60 per 1M text input tokens plus $12 per 1M audio output tokens).

Voice Cloning: Instant and Pro

Gradium offers two voice cloning tiers. Instant Voice Cloning creates a custom voice from as little as 10 seconds of audio, immediately available for TTS generation, with 5 clones on the free tier and up to 1,000/month on paid plans. Professional Voice Cloning is a fine-tuned model for higher speaker fidelity, available from the M plan (5 included) and L plan (20 included).

In a blinded human evaluation benchmark of 3,220 voice pairs (890 sentences per language, 20 voices per language, 10-second source, live Elo ranking) Gradium achieved the highest Elo score across English, French, Spanish, and German. The full methodology is in why your voice cloning sounds fake (and how to fix it).

Among the providers in this comparison, Gradium, ElevenLabs, and Cartesia all support voice cloning. Deepgram Aura-2 and OpenAI standard TTS do not. Always obtain the explicit consent of the person whose voice you are cloning before using any voice cloning API.

Languages: Native Fluency Across Five Languages

Gradium supports five languages with native fluency: English, French, Spanish, German, and Portuguese, with regular updates. Mid-sentence code-switching is supported across all five languages without latency penalty, which matters for multilingual users who switch languages within a single sentence.

Cartesia Sonic-3 offers the broadest coverage in this comparison at 40+ languages. ElevenLabs supports 32 languages across Flash v2.5 and Multilingual v2. Deepgram Aura-2 supports 7. OpenAI TTS supports multiple languages but documentation states voices are currently optimized for English. Raw language count is not the only consideration: per-language quality and consistency vary significantly across providers.

Deployment Options

Gradium offers four deployment paths with the same model and API across all of them: cloud marketplace, private cloud, on-premise, and on-device. The on-device option uses the Phonon model running on a standard smartphone CPU without a server dependency, designed for edge deployments in gaming, mobile applications, and robotics. Among the providers in this comparison, only Deepgram offers on-premise deployment, and none of ElevenLabs, Cartesia, or OpenAI ship an on-device TTS in their standard developer offerings.

For LiveKit and Pipecat users, Gradium ships official integrations. See how to build a voice AI agent with Gradium and LiveKit and building an audiobook agent with Gradium and Pipecat.

Gradium Pricing

Gradium uses a credit-based system: 1 character of TTS equals 1 credit, 1 second of STT equals 3 credits. The free plan includes 45,000 credits per month (approximately 1 hour of TTS or 4 hours of STT) with 5 Instant Voice Clones and no credit card required. Paid plans start at $13/month (XS) and scale to $1,615/month (L), with pay-as-you-go credits available on all paid plans. Full plan details, including bundled Pro clones and credit conversion rates, are at gradium.ai/pricing.

Gradium also runs a Startup Program offering $2,000+ in free credits and 6 months of full API access for qualifying seed-funded teams (M plan equivalent: 1,200 hours of TTS or 4,998 hours of STT).

How Do You Choose the Right TTS API in 2026?

The right TTS API depends on your use case, stack, and scale.

Choose Gradium if you need a unified TTS, STT, and voice cloning stack from a single API, require voice cloning accessible from entry-level plans, need pronunciation robustness for structured data in voice agent pipelines, or are building for edge and on-device deployment.
Choose ElevenLabs Flash v2.5 if voice naturalness and voice library breadth are the primary criteria and your volume does not yet justify enterprise contracts.
Choose Cartesia Sonic-3 if consistent low latency at P99 is the top priority or if you need TTS coverage across 40+ languages.
Choose Deepgram Aura-2 if you are already using Deepgram Nova for STT and want to simplify your vendor stack, or if on-premise deployment with HIPAA compliance is a requirement.
Choose OpenAI TTS if you need simple narration or TTS within an existing OpenAI-native stack and do not require voice cloning or WebSocket streaming.

For a deep dive specifically on real-time voice agents (latency-driven selection criteria, end-to-end pipeline budgets, WebSocket session reuse) see our guide to the best Text-To-Speech APIs for voice agents. Also comparing specific vendors? See Cartesia alternative with Gradium, ElevenLabs alternative with Gradium, and Deepgram alternative with Gradium.