Time To First Audio (TTFA) measures the delay between sending text to a TTS API and receiving the first chunk of playable audio. Unlike TTFB (Time To First Byte), TTFA excludes container headers and measures actual audio arrival. For voice agents, TTFA directly determines how responsive the agent feels in conversation. The target for natural-sounding dialogue is sub-300ms.

Which TTS API has the lowest latency?

Cartesia Sonic 3 reports the lowest raw TTFA (~90ms P50, per Cartesia's published numbers). In Gradium's controlled benchmarks, Gradium achieves 258ms P50 TTFA (214ms with multiplexing), the lowest among Gradium, ElevenLabs, and OpenAI under identical test conditions.

Gradium vs ElevenLabs: which TTS is better for voice agents?

Gradium outperforms ElevenLabs on voice cloning quality (higher ELO scores in all four tested languages, with the gap widening to +380 in German), latency (258ms vs 304ms P50), latency consistency (19ms P25-P95 spread vs 30ms), and deployment flexibility (five options vs cloud-only). ElevenLabs supports more languages (32 vs 5).

Gradium vs OpenAI TTS: which should I use?

Gradium is faster (258ms vs 420ms P50), more consistent under load (19ms P25-P95 spread vs 83ms), and offers five deployment models vs cloud-only. OpenAI TTS is simpler for teams already in the OpenAI ecosystem with 13 preset voices.

How does Gradium voice cloning work?

Gradium uses cross-attention neural networks that attend to a speaker recording during generation. From 10 seconds of audio, the system produces a voice clone usable immediately across all five supported languages. In blind A/B tests (3,220 evaluations), Gradium achieved the highest ELO speaker similarity scores in all four tested languages.

How much does Gradium TTS cost?

Gradium offers a free tier ($0/month, ~1 hour of TTS, 3 concurrent requests) and paid plans from $13/month (XS, ~5 hours) to $1,615/month (L, ~1,000 hours). Voice cloning, WebSocket streaming, and multiplexing are included in all paid plans with no feature gating.

What languages does Gradium support?

Gradium supports English, French, German, Spanish, and Portuguese, all handled by a single model at the same latency. The system maintains voice identity across language switches with the highest ELO scores in non-English voice cloning.

Which TTS API supports on-premises deployment?

Gradium is the only major TTS provider offering full on-premises deployment alongside cloud, inference partner, dedicated instance, and private cloud options. ElevenLabs and OpenAI are cloud-only. Cartesia offers cloud plus limited self-hosted options.

Is Gradium suitable for production voice agents at scale?

Yes. Gradium's streaming architecture maintains stable TTFA under high concurrency, with a P25-P95 spread of just 19ms. The architecture uses batched generation with CUDA graph optimization. Enterprise plans support unlimited concurrent requests with 99.9% uptime SLA guarantees.

Does Gradium work with Pipecat and LiveKit?

Yes. Gradium has official partnerships with both Pipecat and LiveKit, with native plugins maintained in collaboration with the Gradium team. Gradium also integrates with any voice agent framework that supports WebSocket or REST-based TTS.

What Is the Best Text-to-Speech API in 2026 to Build Voice Agents? Complete Developer Comparison

Q: What is the best text-to-speech API for voice agents in 2026?

Gradium is the best overall TTS API for production voice agents in 2026. It achieves 258ms P50 TTFA (214ms with WebSocket multiplexing), the highest voice cloning ELO scores across four languages in blind evaluations, native pronunciation handling for structured data, a 19ms P25-P95 latency spread under load, and five deployment options from cloud to on-premises.

Q: Gradium vs Cartesia: which TTS is faster?

Cartesia Sonic 3 reports lower raw latency (~90ms P50 vs Gradium's 258ms P50). Gradium differentiates with the highest measured voice cloning fidelity across four languages in blind evaluations, a custom pronunciation dictionary API, WebSocket multiplexing, and five deployment models including full on-premises.

TL;DR: Gradium is the best overall TTS API for building voice agents in 2026. It achieves 258ms P50 Time To First Audio (214ms with WebSocket multiplexing), the highest voice cloning ELO scores across four languages in blind evaluations, native pronunciation of structured data, a 19ms P25-P95 latency spread under concurrent load, and five deployment options from cloud API to full on-premises. Cartesia Sonic 3 is faster on raw latency. ElevenLabs Flash v2.5 supports more languages. But no other provider matches Gradium across all five criteria that matter for production voice agents: latency, voice quality, pronunciation robustness, stability under load, and deployment flexibility.

How to choose a TTS API for voice agents

Gradium achieves 258ms P50 Time To First Audio in benchmarks against ElevenLabs and OpenAI, with the highest voice cloning ELO scores across four languages. But latency and quality alone do not determine whether a TTS API works for production voice agents. A TTS API that performs well in isolated tests can fail under real workloads. For voice agents handling real conversations at scale, five criteria separate a viable engine from a liability:

Latency. Time To First Audio (TTFA) determines whether a voice agent feels conversational or sluggish. In natural dialogue, the gap between turns averages around 200ms. TTS is the last stage in a cascaded STT → LLM → TTS pipeline, so every millisecond it adds is a millisecond the user waits. The target is sub-300ms TTFA.
Voice quality. Users judge a voice agent within the first few seconds. The TTS needs to produce expressive, natural-sounding speech with accurate prosody, emotion, and pacing across different accents, speaking styles, and languages. A catalog of diverse preset voices and high-fidelity voice cloning from short audio samples are both important: preset voices cover common use cases quickly, while cloning enables branded or personalized agents.
Pronunciation robustness. Production voice agents encounter dates, email addresses, phone numbers, currency amounts, alphanumeric codes, and domain-specific terminology. A TTS that stumbles on a confirmation number or mispronounces a street name erodes trust. Reliable pronunciation across edge cases, without requiring manual preprocessing, is essential.
Stability under load. A single-request benchmark is not a production benchmark. Voice agents run hundreds or thousands of concurrent calls. The TTS must maintain consistent TTFA and audio quality at scale, with tight P95 latency and no degradation as concurrency increases.
Flexible deployment. Regulatory requirements, data sovereignty, and infrastructure constraints vary by customer. The TTS provider needs to offer cloud, private cloud, and on-premises options so the same engine works for a startup prototype and a regulated enterprise deployment.

This post evaluates Gradium against ElevenLabs, OpenAI, and Cartesia across all five criteria, using published benchmarks, blind listening tests, and production deployment data.

How to measure TTS latency correctly

Most TTS latency benchmarks report Time To First Byte (TTFB). This is misleading. The first bytes a streaming TTS API returns are container metadata (WAV headers, Ogg identification pages, MP3 ID3 tags), not playable audio. A server might respond with headers in 50ms while the first actual audio samples arrive 200ms later. A naive benchmark reports 50ms. The user experiences 250ms.

Time To First Audio (TTFA) measures the delay from request submission to the first chunk of playable audio reaching the client. This is the metric that correlates with perceived responsiveness.

TTS API performance comparison (2026)

The table below shows TTFA measurements across Gradium, ElevenLabs, and OpenAI TTS models. All measurements were taken from Paris using WebSocket APIs where available (POST for OpenAI, which does not offer WebSocket on its TTS API). Input: standardized 15-25 word sentence, same output format and sample rate across providers. 100 queries per model, first 5 discarded (warm state). Network ping: ~5ms to Gradium and ElevenLabs endpoints, ~3ms to OpenAI. Full methodology in Time to First Audio: Measuring and reducing TTS latency in voice agents.

Model	P25	P50	P75	P95	Instant voice cloning	Languages
Gradium	255ms	258ms	263ms	274ms	Yes (10s sample)	EN, FR, DE, ES, PT
ElevenLabs Turbo v2.5	294ms	304ms	311ms	324ms	Yes	32 languages
ElevenLabs Flash v2.5	317ms	324ms	333ms	351ms	Yes	32 languages
OpenAI GPT-4o Mini TTS	400ms	420ms	439ms	483ms	Limited (Custom Voices, restricted access)	Multiple
ElevenLabs Multilingual v2	690ms	706ms	720ms	742ms	Yes	29 languages
OpenAI TTS-1	722ms	969ms	1,232ms	1,807ms	Limited (Custom Voices, restricted access)	Multiple

With WebSocket multiplexing (persistent connection, no per-turn connection overhead), latency drops further:

Model	P25	P50	P75	P95
Gradium	212ms	214ms	219ms	228ms
ElevenLabs Turbo v2.5	248ms	257ms	263ms	278ms
ElevenLabs Flash v2.5	271ms	277ms	284ms	302ms
ElevenLabs Multilingual v2	643ms	657ms	672ms	688ms

Two things stand out. First, Gradium's P50 at 258ms sits under the 300ms conversational threshold, and its P95 at 274ms stays there too, meaning latency is typically predictable under load. With multiplexing, P50 drops to 214ms. Second, Gradium is the only provider in this group that combines sub-300ms TTFA with the highest measured voice cloning fidelity across four languages, word-level streaming, robust pronunciation, and native multilingual support without a latency penalty.

Streaming architecture built for voice agent pipelines

Gradium's TTS generates audio incrementally as text tokens arrive. In a typical voice agent pipeline, the LLM streams tokens to the TTS, and the TTS begins generating audio from the first tokens without waiting for the full sentence.

This matters in practice because:

Audio playback starts before the LLM has finished generating its response, reducing perceived end-to-end latency.
Word-level streaming allows precise synchronization between text and audio, which is necessary for features like real-time captions, lip sync, and barge-in detection.
The tight spread between P25 (255ms) and P95 (274ms) in the benchmarks above indicates consistent generation times per request. For concurrent workloads, Gradium's architecture uses batched generation with CUDA graph optimization, which is designed to maintain this profile as concurrency scales.

For developers integrating Gradium into agent frameworks, the API supports WebSocket streaming and is compatible with LiveKit, Pipecat, and other major orchestration frameworks out of the box.

How Gradium compares on streaming: Gradium, ElevenLabs, and Cartesia all provide WebSocket streaming on their TTS APIs. OpenAI's TTS API uses POST-based streaming only, with no WebSocket option. Cartesia streams via WebSocket with the lowest raw latency in this group (Sonic 3 at ~90ms TTFA). Where Gradium differentiates is the combination of word-level timestamps and connection multiplexing (reusing a single WebSocket across conversation turns, saving ~50ms per turn). Word-level sync enables real-time captions, lip sync, and precise barge-in detection. Multiplexing matters in multi-turn voice agents where connection overhead adds up across dozens of exchanges.

Voice cloning: highest speaker similarity from 10 seconds of audio

Voice agents need two things from their TTS: a voice that sounds natural, and a voice that sounds like the right person. The first is about the model's ability to produce expressive, human-like speech across a wide range of accents, speaking styles, emotional registers, and pacing patterns. The second is about identity: can the system reproduce a specific speaker's characteristics from a short audio sample, and maintain that identity consistently across sessions, languages, and content types?

These two capabilities are closely related. A model that cannot handle diverse accents and speaking styles in its base generation will also produce flat, generic voice clones. The quality of cloned voices reflects the model's underlying ability to represent the full range of human speech variation, not just its ability to copy a waveform.

Gradium evaluated this through a live ELO ranking system: 3,220 blind A/B listening tests across four languages (EN, FR, DE, ES), where human evaluators judged overall voice quality, naturalness, and speaker similarity. The results:

Language	Gradium ELO	ElevenLabs Flash 2.5 ELO	Gap
English	~1950	~1880	+70
French	~2040	~1780	+260
German	~2170	~1790	+380
Spanish	~2030	~1880	+150

Gradium leads in all four languages, with the gap widening significantly outside English. In French, German, and Spanish, the ELO difference ranges from +150 to +380 points, reflecting a substantial perceptual quality advantage in non-English voice cloning. Gradium's cloning preserves micro-traits that other systems tend to smooth out: vocal fry, breathiness, pitch dynamics, accent characteristics, and speaking style. This holds across languages, meaning a voice cloned from an English sample maintains its identity when generating French or German speech.

The minimum audio requirement is 10 seconds for instant cloning. No fine-tuning step, no per-voice training job. Clone a voice via the API, and it is available for synthesis immediately across all supported languages.

For use cases that require the highest possible fidelity, Gradium offers Pro Voice Clones: a higher-tier cloning option that produces even more accurate voice reproduction, suitable for branded agents, public-facing products, or any application where the cloned voice needs to be indistinguishable from the original speaker.

Gradium also provides a catalog of pre-built voices spanning a range of ages, accents, tones, and genders, so developers can match a voice to their brand or audience without recording custom audio. For cases where the catalog does not fit, instant cloning fills the gap: supply 10 seconds of any target voice, and the system produces a clone that can be used across all languages and sessions. Between the catalog and cloning, there is no constraint on voice identity. Customer service agents, in-game NPCs, branded assistants, and accessibility tools can each use a voice that matches their context.

How Gradium compares on voice cloning: ElevenLabs, Cartesia, and Gradium all offer voice cloning from short audio samples (10-15 seconds). OpenAI has announced Custom Voices but access remains restricted to selected partners as of early 2026. The difference among available options is measurable quality. In blind ELO evaluations across four languages, Gradium leads ElevenLabs Flash 2.5 by +70 (English) to +380 (German). The gap is largest in non-English languages, where accent preservation and speaker identity are hardest. Cartesia's Sonic 3 supports cloning but independent per-language blind evaluations at this scale are not publicly available for comparison. OpenAI's Custom Voices remain in restricted access and lack published cross-language ELO data. For applications where the cloned voice needs to sound right across accents and languages, Gradium's lead in blind tests is the clearest signal available.

Multilingual support: same latency, same quality, every language

Gradium natively supports five languages within a single model architecture: English, French, German, Spanish, and Portuguese. This is not a separate model per language. The same model handles all five while maintaining speaker identity.

Most TTS providers now support multiple languages, but language coverage and localization quality are different things. A model can generate speech in 30+ languages while still sounding flat or generic in non-English accents. For voice agents, the goal is not just generating French or German speech, but generating it in a way that sounds local: correct accent, natural prosody, and consistent speaker identity. Gradium's architecture handles all five languages at the same latency with no speed penalty for switching between them.

The ELO voice cloning benchmarks above reinforce this. Gradium's quality advantage is largest in non-English languages, which means voice agents operating across European markets get natural-sounding output in every supported language, not just English.

This is particularly relevant for:

Voice agents serving multilingual users. A single deployment handles callers in any supported language without routing to separate TTS instances or accepting higher latency.
Live translation workflows. Gradium has partnered with Acolad, a global language services provider, to integrate real-time voice into multilingual enterprise workflows.
Pan-European and Latin American deployments at scale. A contact center operating across France, Germany, Spain, Portugal, and Brazil uses one API, one model, one consistent voice identity, with no per-language latency tradeoff. This simplifies infrastructure and reduces operational cost compared to maintaining separate TTS providers or model variants per region.

How Gradium compares on multilingual: ElevenLabs Flash v2.5 supports 32 languages, Cartesia covers 40+, and OpenAI handles multiple languages (Custom Voices announced but access is restricted). Raw language count is not the differentiator. What matters for localization is how well the cloned voice preserves accent, identity, and naturalness in each target language. Gradium's ELO advantage is largest in non-English languages (+260 French, +380 German, +150 Spanish), which means a voice agent localized for European markets sounds more natural and more like the original speaker. Other providers support the same languages on paper, but the perceptual quality gap widens outside English. For teams building voice agents that need to sound local in every market they serve, that gap is what determines whether users trust the voice.

Pronunciation accuracy: handling real-world voice agent inputs

Most TTS demos use clean, well-formed sentences. Production voice agents read back confirmation numbers, email addresses, dates in regional formats, currency amounts, phone numbers, URLs, and mixed alphanumeric strings. A single mispronunciation in a booking confirmation or account number erodes user trust.

Gradium handles these cases natively without requiring text preprocessing or special formatting. The API also accepts a custom pronunciation dictionary for domain-specific terms (medical terminology, brand names, product codes, internal acronyms) that a general model would not encounter in training data.

This matters because some competing providers require developers to manually format inputs for correct pronunciation (inserting spaces in phone numbers, spelling out abbreviations, reformatting dates). With Gradium, the same raw text the LLM generates can be sent directly to the TTS without an intermediate normalization step. In testing, pronunciation remains consistent across long-form generations without degradation.

How Gradium compares on pronunciation: ElevenLabs and OpenAI handle standard text well but can require preprocessing for structured data like phone numbers, alphanumeric codes, and mixed-format strings. Cartesia's pronunciation handling is less documented. Gradium handles these edge cases natively, plus offers a custom pronunciation dictionary API for domain-specific terms. For voice agents that read back confirmation numbers, addresses, and account details, this eliminates a preprocessing step and reduces the surface area for errors.

TTS API pricing: free tier to enterprise

Gradium offers tiered pricing designed to scale from prototyping through production:

Plan	Price	TTS hours (approx.)	Concurrent requests	Voice clones
Free	$0/month	~1 hour	3	Limited
XS	$13/month	~5 hours	5	1,000
S	$43/month	~20 hours	5	1,000
M	$340/month	~200 hours	10	1,000
L	$1,615/month	~1,000 hours	15	Unlimited
Enterprise	Custom	Custom	Unlimited	Unlimited

Gradium prices by millions of characters processed, not by audio duration. The "TTS hours" column above is an approximation based on typical speaking rates (~150 words per minute, ~5 characters per word). Actual hours vary depending on the content: dense technical text with abbreviations and numbers yields fewer audio hours per character than conversational dialogue.

At the L tier, the effective rate is approximately $1.62/hour of synthesized speech. Pay-as-you-go overflow is available at $3.80 per 100k additional credits.

The Free tier provides enough usage for meaningful evaluation (~1 hour of TTS with 3 concurrent requests), which is more generous than most competing real-time TTS APIs at the free level.

Voice cloning is included in all paid plans. There are no separate charges for streaming, WebSocket access, or API features.

For high-volume production deployments, enterprise pricing includes custom concurrency limits, SLA guarantees, and on-premises deployment options. Contact Gradium for details.

How Gradium compares on pricing: ElevenLabs' pricing starts higher for equivalent usage tiers and charges separately for some features (voice cloning quality tiers, certain API access modes). OpenAI's TTS pricing is competitive per-character but deployment is cloud-only. Cartesia offers competitive pricing with fast latency but fewer deployment options. Gradium includes voice cloning, WebSocket streaming, and multiplexing in all paid plans with no feature gating. The Free tier (3 concurrent requests, ~1 hour of TTS) is sufficient for meaningful evaluation before committing.

Who is using Gradium in production

Voice agents and conversational AI. This is Gradium's primary design target. Production applications include customer service bots, AI receptionists, appointment booking systems, market research and survey calls, sales automation, outbound calling, and IVR systems. Wonderful, which builds AI voice agents for real-world conversational use cases, runs on Gradium's streaming TTS infrastructure. In regulated industries (healthcare, finance), on-premises deployment ensures audio data never leaves the customer's environment.

Gaming and interactive entertainment. Sub-300ms latency enables dynamic NPC dialogue that responds to player input without breaking immersion. Ten-second voice cloning means hundreds of unique character voices can be generated programmatically rather than recorded in a studio. InteractionLabs uses Gradium to bring expressive, real-time voice AI to its Ongo robot.

Live translation and interpretation. Native multilingual support with voice identity preservation across languages makes Gradium suitable for real-time speech translation and conference interpretation. Acolad, a global language services provider, has partnered with Gradium to integrate real-time voice into multilingual enterprise workflows.

Accessibility. Gradium powers Invincible Voice, an open-source assistive system helping people with ALS and speech loss communicate in real time using their own cloned voice.

Framework integrations. Gradium has close partnerships with Pipecat and LiveKit, the two most widely used open-source frameworks for building real-time voice agent pipelines. Both offer native Gradium plugins maintained in collaboration with the Gradium team. Beyond Pipecat and LiveKit, Gradium integrates with any voice agent framework that supports WebSocket or REST-based TTS. The API is framework-agnostic: if your orchestration layer can send text and receive streaming audio, Gradium works with it.

TTS deployment options: cloud to on-premises

Gradium supports five deployment models, depending on latency requirements, data constraints, and infrastructure:

Cloud API. The fastest way to get started. Hosted API with endpoints in multiple regions. Suitable for prototyping and production workloads where data residency is not a constraint.
Inference partner deployments. Gradium deploys its API on infrastructure partners in multiple locations worldwide, allowing colocations with existing LLM providers to minimize inter-service latency.
Dedicated instances. Reserved compute with guaranteed capacity and deterministic latency. No shared-infrastructure variance.
Private cloud. Self-hosted inference on your own GPU infrastructure. You manage the deployment; Gradium provides the model and support.
On-premises. Full deployment within your environment for strict data sovereignty or regulatory constraints (healthcare, financial services). Audio data never leaves your infrastructure.

All deployment models support the same API surface, the same models, and the same voice cloning capabilities. Enterprise plans include unlimited concurrent requests, multi-region deployment, auto-scaling, 99.9% uptime SLA, and direct engineering support.

Developer tooling includes REST API, official Python and Rust SDKs, comprehensive documentation at docs.gradium.ai, and native integration with LiveKit and Pipecat.

How Gradium compares on deployment: ElevenLabs and OpenAI are cloud-only. Cartesia offers cloud plus some self-hosted options. Gradium provides five deployment models (cloud, inference partners, dedicated instances, private cloud, on-premises), all with the same API surface and model capabilities. For regulated industries (healthcare, finance, government) or teams with strict data sovereignty requirements, the range from shared cloud to full on-prem under one provider is often the deciding factor.

Overall comparison: Gradium vs the field

No single metric tells the full story. A TTS API can win on latency but lack voice cloning. It can offer broad language support but at 3x the latency. The table below summarizes how each provider performs across the five criteria that matter for production voice agents.

Criteria	Gradium	ElevenLabs	OpenAI	Cartesia
Latency (TTFA)	258ms P50 / 274ms P95 (214ms / 228ms multiplexed)	304ms P50 / 324ms P95 (Turbo v2.5)	420ms P50 / 483ms P95 (GPT-4o Mini TTS)	~90ms P50 (Sonic 3, self-reported)
Voice quality (cloning)	Highest ELO in 4/4 languages (3,220 blind tests)	Good cloning, 32 languages (lower ELO in non-EN)	13 presets; Custom Voices announced (restricted access)	Cloning (15s), 40+ languages
Pronunciation	Native handling + custom dictionary API	Good, some preprocessing for edge cases	Standard	Good
Stability under load	19ms P25-P95 spread (255-274ms), CUDA graph optimization	30ms spread (294-324ms, Turbo)	83ms spread (400-483ms)	Not independently benchmarked
Deployment flexibility	Cloud, partners, dedicated, private cloud, on-prem	Cloud only	Cloud only	Cloud + limited self-hosted

Note: Gradium, ElevenLabs, and OpenAI latency numbers are from Gradium's published benchmarks under identical conditions. Cartesia's ~90ms figure is from Cartesia's own published claims and was not tested in the same benchmark. Independent head-to-head testing under identical methodology would be needed for a direct comparison.

Every provider in this table is production-capable. The question is where each one forces a tradeoff.

Gradium vs Cartesia

Cartesia Sonic 3 reports the lowest raw latency in the market (~90ms TTFA). For applications where latency is the only priority, Cartesia is a strong choice. Where Gradium differentiates: the highest measured voice cloning fidelity across four languages in blind evaluations, a custom pronunciation dictionary API, and five deployment models including full on-premises. Cartesia's voice cloning quality across non-English languages has not been independently benchmarked at the same scale. For a full comparison covering TTS pronunciation, semantic VAD, voice cloning, and deployment, see Cartesia alternative: why developers choose Gradium.

Gradium vs ElevenLabs

ElevenLabs is the closest competitor across all five criteria. Flash v2.5 supports 32 languages at good latency (324ms P50). The tradeoff: in blind ELO evaluations, ElevenLabs Flash trails Gradium by +70 (English) to +380 (German) on voice cloning fidelity. The gap is largest in non-English languages, which matters for localized voice agents. ElevenLabs also has a wider P25-P95 spread (30ms vs Gradium's 19ms) and is cloud-only, with no on-premises or private cloud deployment option. For a full comparison covering TTS, STT, voice cloning, platform neutrality, and pricing, see ElevenLabs alternative: why developers choose Gradium.

Gradium vs OpenAI

OpenAI offers the simplest integration for teams already using the OpenAI ecosystem, with 13 preset voices and Custom Voices announced but still in restricted access. The tradeoffs: higher latency (420ms P50), a wider P25-P95 spread (83ms), POST-only streaming (no WebSocket), cloud-only deployment, and no published cross-language voice cloning benchmarks. For teams that need sub-300ms TTFA, voice cloning available today, WebSocket streaming, or deployment flexibility, Gradium fills the gaps OpenAI leaves.

Gradium's differentiator is not winning any single metric in isolation. It is the only provider where sub-300ms latency with a 19ms P25-P95 spread, the highest measured cloning fidelity across languages, robust pronunciation, and five deployment options all ship in the same API. That combination is what matters when the voice agent goes to production.

This post focused on choosing a TTS API for voice agents. For deeper technical context on the topics covered here:

Time to First Audio: Measuring and reducing TTS latency in voice agents covers the full benchmarking methodology, TTFB vs TTFA measurement, and WebSocket multiplexing optimization.
Optimizing quality vs. latency in real-time TTS AI models explains the Delayed Streams Modeling architecture, RVQ codebook tradeoffs, and how Gradium balances audio quality with inference speed.
Why your voice cloning sounds fake (and how to fix it) details the cross-attention cloning architecture, CFG scale tuning, and the full ELO evaluation methodology across 3,220 blind tests.

Getting started

Gradium offers a free tier for evaluation. Sign up at gradium.ai, generate an API key, and start streaming TTS in minutes. Documentation and quickstart guides are available at docs.gradium.ai.

For enterprise evaluations or technical questions, reach out at contact@gradium.ai or visit gradium.ai.