What is the best multilingual TTS API in 2026?

There is no single best multilingual TTS API for every use case. Cartesia Sonic-3 supports 40+ languages with regional accents and is the best fit for maximum language breadth. ElevenLabs supports 32 languages and is strongest for content creation. Gradium supports 5 languages with native fluency, mid-sentence code-switching, and the best-documented voice cloning benchmark in English, French, German, and Spanish. Deepgram Aura-2 supports 7 languages for teams already on the Deepgram platform.

What is code-switching in TTS and which APIs support it?

Code-switching in TTS is the ability to synthesize text that contains multiple languages within a single utterance, without issuing separate API calls or experiencing quality degradation at the language transition point. Gradium supports native mid-sentence code-switching across its 5 supported languages: English, French, German, Spanish, and Portuguese. Most other TTS providers require separate requests per language.

Does language breadth affect TTS quality?

Yes, in practice. Models that support a very large number of languages often train on less data per language than models focused on fewer languages. This can result in lower fidelity, accented output, or inconsistent prosody in less-represented languages. For the specific languages your product requires, the relevant evaluation is per-language quality, not the total language count.

Which TTS API supports voice cloning in multiple languages?

Gradium supports voice cloning with cross-lingual synthesis in English, French, German, and Spanish. In a benchmark of 3,220 blinded human evaluations, Gradium's Instant Voice Clone achieved the highest Elo score and an 8-11% speaker similarity advantage across all four languages. ElevenLabs supports cross-lingual voice cloning across 32 languages. Cartesia supports it across 40+ languages. Deepgram Aura-2 does not support voice cloning.

What is cross-lingual voice cloning?

Cross-lingual voice cloning means a voice cloned from an audio sample in one language can synthesize text in a different language. The output retains the speaker's characteristic voice while adapting to the phoneme inventory and prosodic patterns of the target language. This is useful for building multilingual products where a consistent brand voice is required across all language versions.

Does Gradium support regional accents?

Gradium supports 5 languages with native fluency: English, French, German, Spanish, and Portuguese. Regional accent variants within these languages are documented for the supported languages; contact Gradium directly for specific regional variant requirements.

What is the difference between multilingual TTS and language-switched TTS?

Multilingual TTS refers to a model trained to synthesize speech in multiple languages, handling each language's phoneme inventory and prosodic patterns natively. Language-switched TTS refers to switching between separate monolingual models per request. The distinction matters for code-switching: a multilingual model can handle language transitions within a single utterance, while a language-switching system requires separate API calls per language segment.

Which TTS API has the lowest latency for multilingual synthesis?

Based on the independent Coval TTS benchmark, Gradium achieves the lowest TTFA at 155 ms P50 across all 5 supported languages. Cartesia Sonic-3 follows at 188 ms P50 across 40+ languages, and ElevenLabs Flash v2.5 at 288 ms P50 across 32 languages. Gradium's own end-to-end measurements range from P50 214 ms (excluding connection establishment) to P50 258 ms (end-to-end). All three are suitable for real-time multilingual voice agents.

How many languages does Gradium support?

Gradium supports 5 languages with native fluency: English, French, German, Spanish, and Portuguese, with regular updates. All 5 languages are available on every plan, including the free tier, and credits are consumed identically regardless of language.

Can Gradium handle mid-sentence language transitions?

Yes. Gradium supports native mid-sentence code-switching across all 5 supported languages without quality degradation or a latency penalty. A single API call can synthesize an utterance that transitions between languages, which is useful for voice agents serving bilingual speakers or markets where code-switching is common in spoken communication, such as US Spanish-English or Swiss French-German.

Is voice cloning available across multiple languages on Gradium?

Yes. Gradium supports cross-lingual voice cloning across English, French, German, and Spanish for both Instant Voice Cloning and Professional Voice Cloning. A voice cloned in one of these languages can synthesize text in any of the others, preserving the speaker's characteristic timbre while adapting to the target language's phonology.

Gradium was founded by Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez, co-founders of Kyutai. They previously released Moshi, a real-time speech-to-speech model, and Hibiki, a live speech-to-speech translation model.

What pricing plans does Gradium offer for multilingual TTS?

Gradium offers a free plan at $0/month with 45,000 credits (~1 hour of TTS or 4 hours of STT), paid plans starting at $13/month (XS), with higher tiers up to $1,615/month (L). All 5 languages are included on every plan and credits are consumed identically regardless of language. See gradium.ai/pricing for the current breakdown.

Best Multilingual TTS APIs in 2026: Coverage, Quality, Code-Switching

Gradium is a streaming Text-To-Speech and Speech-To-Text API built for voice agents, with native fluency in English, French, German, Spanish, and Portuguese and mid-sentence code-switching across all five. Most multilingual TTS providers advertise their language count first. That number tells only part of the story. The relevant questions for production deployments are different: how natural does each language sound at streaming speed, can the model switch languages mid-sentence without degrading quality or increasing latency, does voice cloning carry across languages, and what happens to regional accents.

This guide is for teams choosing a multilingual TTS API for voice agents, customer support, or content creation in 2026. It compares the leading providers, Gradium, ElevenLabs, Cartesia Sonic-3, and Deepgram Aura-2, on the dimensions that determine real-world multilingual performance: language depth (not just breadth), code-switching, regional accent support, cross-lingual voice cloning, and latency per language.

Is Language Breadth or Language Depth More Important?

Language breadth is the number of languages a TTS API supports. It is the most visible metric and often the first criterion in provider comparisons.

Language depth is the quality and consistency of synthesis within each supported language. A model trained on less data per language may support 40 languages but produce noticeably lower quality in 30 of them. A model trained intensively on 5 languages may produce near-native naturalness in all five.

Within a given language, depth also covers the ability to reproduce different regional accents and speaking styles. A "Spanish" voice that only produces neutral Castilian is structurally limited compared to a model that can reproduce Mexican, Argentinian, Caribbean, or Colombian accents, and that can shift between conversational, narrative, broadcast, customer-service, and expressive styles. For voice agents and branded experiences, accent and style coverage often matters as much as raw per-language fluency.

For global consumer products where broad language coverage is required, breadth matters. For products targeting specific language markets where voice quality is a differentiator (customer support, voice agents, branded experiences), depth matters as much or more than breadth.

Neither dimension alone is sufficient. The right evaluation question is: at what quality level, with what accent and style coverage, does this API support the specific languages your product needs?

What Makes a TTS API Truly Multilingual?

Native Fluency

A TTS model trained primarily on English data and extended to other languages often produces accented output in non-English languages. The synthesized voice sounds like a non-native speaker of the target language. This is technically multilingual but may be unsuitable for consumer-facing applications in non-English markets. Native fluency in a language requires training on substantial native-speaker data, with attention to phoneme inventory, prosody patterns, and natural speech rhythm specific to the language.

Code-Switching

Code-switching is the ability to handle text that mixes two or more languages within a single utterance. For example, an agent that says "Thank you for calling. Votre numéro de dossier est le 4872." in a single turn without pausing to reinitialize a different language model. Code-switching without quality degradation is technically difficult: the model must detect language transitions, apply the correct phoneme inventory for each segment, and maintain prosodic continuity across the switch. Most TTS APIs require separate API calls per language; few handle in-sentence transitions natively.

Regional Accents

Languages like Spanish, English, French, and Portuguese are spoken across multiple continents with distinct regional phonologies. An API that supports "Spanish" may produce Castilian Spanish, Latin American Spanish, or an undifferentiated blend. For products targeting specific regions, regional accent variants matter.

Cross-Lingual Voice Cloning

For products using voice cloning, cross-lingual cloning means a voice cloned in one language can synthesize text in another language. The cloned voice retains the speaker's characteristic timbre and prosody while adapting to the phoneme inventory of the target language. This is technically challenging and not universally supported.

Latency Per Language

Some TTS APIs route requests to language-specific models, which may have different infrastructure, latency profiles, or throughput limits. For consistent real-time performance across languages, Time-To-First-Audio (TTFA) should be consistent regardless of the language being synthesized.

How Do the Leading Multilingual TTS APIs Compare in 2026?

Dimension	Gradium	ElevenLabs	Cartesia Sonic-3	Deepgram Aura-2
Languages	5 (EN, FR, DE, ES, PT)	32	40+	7 (EN, ES, FR, DE, NL, IT, JA)
Native fluency	Yes, all 5 languages	Varies by language	Varies by language	Varies by language
Code-switching	Native, mid-sentence	Limited	Not publicly documented	Not documented
Regional accents	Documented for supported languages	Some languages	Regional variants listed	Not documented
Cross-lingual voice cloning	EN, FR, DE, ES, PT (highest blinded-eval Elo)	32 languages (fidelity varies)	40+ languages	Not supported
TTFA (Coval P50)	155 ms	288 ms (Flash v2.5)	188 ms	313 ms
Pricing entry point	$0/month, 45,000 credits	0.5 credits/char (Flash v2.5, credit-based)	$4/month (Pro, annual)	$0.030 per 1,000 chars

P50 figures are from the independent Coval TTS leaderboard. For Gradium's own end-to-end measurements, see the published Time-To-First-Audio benchmark.

Who Is Gradium?

Gradium supports 5 languages with native fluency: English, French, German, Spanish, and Portuguese. Rather than optimizing for breadth, Gradium's multilingual capability is built around depth in each supported language and the ability to handle mixed-language content natively. The company was founded by Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez, co-founders of Kyutai, where they previously released Moshi, a real-time speech-to-speech model, and Hibiki, a live speech-to-speech translation model.

Native Fluency Across All Five Languages

Gradium's TTS produces native-fluency output in each of its 5 supported languages. The synthesized voice in French, German, Spanish, and Portuguese reflects the phoneme inventory and prosodic patterns of the target language, not an accented English baseline extended to other markets.

Native Mid-Sentence Code-Switching

Gradium supports mid-sentence code-switching across all 5 languages without a quality or latency penalty. A single API call can synthesize text that transitions between languages within the same utterance. For example, a voice agent serving bilingual speakers can move naturally between English and Spanish within a single turn without restructuring the pipeline or issuing separate TTS requests. This is particularly relevant for markets where code-switching is natural in spoken communication, such as US Spanish-English.

Cross-Lingual Voice Cloning

Gradium's voice cloning supports cross-lingual synthesis in English, French, Spanish, and German. A voice cloned in one of these languages can synthesize text in any of the others, maintaining the speaker's characteristic timbre while adapting to the target language phonology. In a blinded human evaluation benchmark of 3,220 voice pairs (890 sentences per language, 20 voices per language, 10-second source clips), Gradium's Instant Voice Clone achieved the highest Elo score across English, French, Spanish, and German, with an 8-11% speaker similarity advantage over comparison providers. For more on why most voice clones still sound fake and how this evaluation was constructed, see Why most voice clones sound fake.

Latency Independent of Language

Latency is consistent across all 5 supported languages. On the independent Coval benchmark, Gradium achieves a P50 TTFA of 155 ms. Gradium's own end-to-end measurements (P50 258 ms end-to-end, P50 214 ms excluding connection establishment, Paris, 15-25 word sentence, WebSocket, 100 queries, warm) are published in the Time-To-First-Audio benchmark. Routing does not select per-language models, the same model serves all 5 languages, which keeps the latency profile stable regardless of which language a given utterance contains, including across in-sentence code-switches.

Voice Library

Gradium ships a curated library of voices designed for voice agents, with consistent character across the 5 supported languages. Each voice supports cross-lingual synthesis across EN/FR/DE/ES/PT, so a single brand voice can answer in any of those languages without re-cloning, with a relevant accent in the target language.

Pricing and Plans

All 5 languages are included on every Gradium plan. Credits are consumed identically regardless of language. The free plan is $0/month with 45,000 credits (~1 hour of TTS or 4 hours of STT). Paid plans start at $13/month (XS). See gradium.ai/pricing for the current breakdown.

Best For

Gradium is the strongest fit for products targeting English, French, German, Spanish, or Portuguese markets where per-language voice quality, real-time streaming latency, and consistent code-switching behavior are required together. Validated use cases include multilingual customer support, international call translation, and bilingual voice agents where language transitions occur within a single conversational turn. On pricing, Gradium is approximately 3-4x less expensive than ElevenLabs for comparable TTS volume, which compounds as multilingual deployments scale.

Who Is ElevenLabs?

ElevenLabs supports 32 languages across its TTS models, the widest coverage option among providers focused on voice quality. Two models are relevant for multilingual deployments.

Flash v2.5 (Multilingual)

32 languages with broad coverage across European, Asian, and Latin American markets.
TTFA P50 288 ms on the Coval benchmark, suited for real-time voice agents requiring wide language coverage.
0.5 credits per character (credit-based pricing; effective per-character cost varies by plan tier).

Multilingual v2

29 languages with the highest per-language voice quality in ElevenLabs' catalogue.
TTFA P50 1,232 ms on Coval, not suited for real-time voice agents, but appropriate for batch content creation.
1 credit per character (credit-based pricing; effective per-character cost varies by plan tier).

Cross-Lingual Voice Cloning

ElevenLabs supports cross-lingual voice cloning: a voice cloned in one language can be used to synthesize text in any of the 32 supported languages. Quality of cross-lingual cloning varies by language pair and is generally higher for languages with larger training-data representation.

Best For

ElevenLabs is the strongest choice when broad language coverage (32 languages) is the primary requirement, particularly for content creation use cases (narration, dubbing, localization) where voice naturalness and access to a wide voice library matter more than real-time latency. Flash v2.5 is the option for multilingual voice agents at scale. See the full ElevenLabs alternative comparison for a deeper look at the trade-offs against Gradium.

Who Is Cartesia?

Cartesia supports 40+ languages with regional accent variants, the widest language coverage in this comparison.

Language and Accent Coverage

Cartesia's Sonic-3 model is documented for 40+ languages with explicit regional accent support, including Latin American Spanish alongside Castilian, Brazilian Portuguese alongside European Portuguese, American and British English, and multiple French variants. For products requiring consistent regional voice characteristics, not just general language support, Cartesia's regional variant documentation is more explicit than most providers.

Cross-Lingual Voice Cloning

Cartesia supports instant voice cloning (from 10 seconds of audio), with synthesis available in any of the 40+ supported languages.

Latency

Cartesia Sonic-3 delivers a P50 TTFA of 188 ms on the Coval benchmark. The State Space Model (SSM) architecture produces consistent latency across languages, including at P99.

Pricing

Pro at $4/month (100K credits, annual billing), Startup at $39/month (1.25M credits, annual billing), Scale at $239/month (8M credits, annual billing), approximately $0.03 per minute.

Best For

Cartesia is the strongest choice when the widest possible language and regional accent coverage is required, combined with low-latency streaming. Particularly suited to international products that must serve 10+ distinct language markets simultaneously with consistent voice quality. See the Cartesia alternative comparison for a side-by-side breakdown against Gradium.

Who Is Deepgram?

Deepgram Aura-2 supports 7 languages: English, Spanish, French, German, Dutch, Italian, and Japanese. No voice cloning is available.

Language Coverage

Aura-2's 7-language coverage is the narrowest in this comparison but covers the major European languages plus Japanese. For products targeting these specific markets and already using Deepgram Nova for STT, Aura-2 provides a consistent multilingual TTS layer without adding a new vendor.

Latency and Streaming

P50 TTFA of 313 ms on the Coval benchmark, with WebSocket streaming.

Pricing

$0.030 per 1,000 characters.

Best For

Deepgram Aura-2 is suited to teams already on the Deepgram platform who need TTS in up to 7 languages and do not require voice cloning or code-switching. See the Deepgram alternative comparison for a fuller breakdown.

Which Multilingual TTS API Should You Choose by Use Case?

Bilingual or Code-Switching Voice Agents

Gradium is the only provider in this comparison with documented mid-sentence code-switching support. For agents that serve bilingual speakers, such as US Spanish-English, or that operate in linguistically mixed environments, this is a structural requirement rather than a nice-to-have.

Global Product With 10+ Language Markets

Cartesia Sonic-3 at 40+ languages with regional accent variants is the strongest fit for maximum language breadth. ElevenLabs at 32 languages is the alternative if voice library depth and quality per language are more important than coverage count.

Multilingual Content Creation

ElevenLabs Multilingual v2 offers the highest per-language voice quality in this comparison and the broadest cross-lingual cloning library. Batch rendering and content creation workflows are better suited to Multilingual v2's quality profile, which trades TTFA for output fidelity.

Five Specific Languages With the Highest Per-Language Quality

Gradium's depth-first approach, 5 languages, native fluency, full feature parity across all 5, is the right choice when the target market is entirely within English, French, German, Spanish, or Portuguese.

Seven-Language Coverage on an Existing Deepgram Stack

Deepgram Aura-2 for teams already using Deepgram Nova who want to add TTS without vendor fragmentation.

How Does Gradium Handle Multilingual Voice Cloning?

Gradium ships two voice cloning modes: Instant and Pro. Instant Voice Cloning produces a usable voice from 10 seconds of source audio and supports cross-lingual synthesis across English, French, Spanish, and German, with up to 1,000 clones per month on paid plans. Professional Voice Cloning is fine-tuned on larger source datasets and is available from the M plan (5 Pro voices) and L plan (20 Pro voices) upwards. Both modes preserve speaker timbre across language boundaries, the same cloned voice can answer in English on one turn and Spanish on the next with consistent identity.

How Do You Configure Gradium TTS for Multilingual Streaming?

Gradium TTS streams over WebSocket, with the language detected automatically from the input text, no language flag is required, including for mid-sentence transitions. Behavior is tuned through json_config, which controls codebook depth, voice selection, pronunciation overrides, and text normalization. For long-form multilingual outputs (audiobooks, narration), pair json_config with pronunciation dictionaries and text normalization rules to handle proper nouns, code-mixed entities, and language-specific number formats. Multiple language streams can share a single connection via WebSocket multiplexing.

Deployment Options

Gradium offers four deployment surfaces with the same model and API across all four: cloud marketplace, private cloud, on-premise, and on-device. Multilingual behavior is identical across deployment modes, code-switching, native fluency, and cross-lingual voice cloning are not features that disappear in self-hosted environments. Teams with data residency or regulatory requirements (financial services, healthcare, public sector) can run the same multilingual TTS in their own infrastructure.

Gradium Pricing

All 5 languages are included on every Gradium plan, including the free tier. The free plan is $0/month with 45,000 credits (~1 hour of TTS or 4 hours of STT), 5 Instant Voice Clones, no commercial use. Paid plans start at $13/month (XS); plan M is $340/month, plan L is $1,615/month. The Startup Program offers seed-funded companies $2,000+ in free credits and 6 months of full API access at M-plan capacity (1,200 hours of TTS or 4,998 hours of STT). See gradium.ai/pricing for the current breakdown.

Getting Started

Gradium ships official Python and Rust SDKs, with first-class integrations for LiveKit and Pipecat. For multilingual voice agents, the standard pattern is to combine Gradium STT (with semantic VAD) and Gradium TTS over a single WebSocket session, letting the model handle in-sentence language transitions automatically. For broader context on selecting a Text-To-Speech API for voice agents, see the best Text-To-Speech API for voice agents. Create an account at gradium.ai to use the free tier across all 5 supported languages.

Is Language Breadth or Language Depth More Important?

What Makes a TTS API Truly Multilingual?

Native Fluency

Code-Switching

Regional Accents

Cross-Lingual Voice Cloning

Latency Per Language

How Do the Leading Multilingual TTS APIs Compare in 2026?

Who Is Gradium?

Native Fluency Across All Five Languages

Native Mid-Sentence Code-Switching

Cross-Lingual Voice Cloning

Latency Independent of Language

Voice Library

Pricing and Plans

Best For

Who Is ElevenLabs?

Flash v2.5 (Multilingual)

Multilingual v2

Cross-Lingual Voice Cloning

Best For

Who Is Cartesia?

Language and Accent Coverage

Cross-Lingual Voice Cloning

Latency

Pricing

Best For

Who Is Deepgram?

Language Coverage

Latency and Streaming

Pricing

Best For

Which Multilingual TTS API Should You Choose by Use Case?

Bilingual or Code-Switching Voice Agents

Global Product With 10+ Language Markets

Multilingual Content Creation

Five Specific Languages With the Highest Per-Language Quality

Seven-Language Coverage on an Existing Deepgram Stack

How Does Gradium Handle Multilingual Voice Cloning?

How Do You Configure Gradium TTS for Multilingual Streaming?

Deployment Options

Gradium Pricing

Getting Started

Frequently Asked Questions