Best Multilingual TTS APIs in 2026: Coverage, Quality, Code-Switching
Gradium is a streaming Text-To-Speech and Speech-To-Text API built for voice agents, with native fluency in English, French, German, Spanish, and Portuguese and mid-sentence code-switching across all five. Most multilingual TTS providers advertise their language count first. That number tells only part of the story. The relevant questions for production deployments are different: how natural does each language sound at streaming speed, can the model switch languages mid-sentence without degrading quality or increasing latency, does voice cloning carry across languages, and what happens to regional accents.
This guide is for teams choosing a multilingual TTS API for voice agents, customer support, or content creation in 2026. It compares the leading providers, Gradium, ElevenLabs, Cartesia Sonic-3, and Deepgram Aura-2, on the dimensions that determine real-world multilingual performance: language depth (not just breadth), code-switching, regional accent support, cross-lingual voice cloning, and latency per language.
Is Language Breadth or Language Depth More Important?
Language breadth is the number of languages a TTS API supports. It is the most visible metric and often the first criterion in provider comparisons.
Language depth is the quality and consistency of synthesis within each supported language. A model trained on less data per language may support 40 languages but produce noticeably lower quality in 30 of them. A model trained intensively on 5 languages may produce near-native naturalness in all five.
Within a given language, depth also covers the ability to reproduce different regional accents and speaking styles. A "Spanish" voice that only produces neutral Castilian is structurally limited compared to a model that can reproduce Mexican, Argentinian, Caribbean, or Colombian accents, and that can shift between conversational, narrative, broadcast, customer-service, and expressive styles. For voice agents and branded experiences, accent and style coverage often matters as much as raw per-language fluency.
For global consumer products where broad language coverage is required, breadth matters. For products targeting specific language markets where voice quality is a differentiator (customer support, voice agents, branded experiences), depth matters as much or more than breadth.
Neither dimension alone is sufficient. The right evaluation question is: at what quality level, with what accent and style coverage, does this API support the specific languages your product needs?
What Makes a TTS API Truly Multilingual?
Native Fluency
A TTS model trained primarily on English data and extended to other languages often produces accented output in non-English languages. The synthesized voice sounds like a non-native speaker of the target language. This is technically multilingual but may be unsuitable for consumer-facing applications in non-English markets. Native fluency in a language requires training on substantial native-speaker data, with attention to phoneme inventory, prosody patterns, and natural speech rhythm specific to the language.
Code-Switching
Code-switching is the ability to handle text that mixes two or more languages within a single utterance. For example, an agent that says "Thank you for calling. Votre numéro de dossier est le 4872." in a single turn without pausing to reinitialize a different language model. Code-switching without quality degradation is technically difficult: the model must detect language transitions, apply the correct phoneme inventory for each segment, and maintain prosodic continuity across the switch. Most TTS APIs require separate API calls per language; few handle in-sentence transitions natively.
Regional Accents
Languages like Spanish, English, French, and Portuguese are spoken across multiple continents with distinct regional phonologies. An API that supports "Spanish" may produce Castilian Spanish, Latin American Spanish, or an undifferentiated blend. For products targeting specific regions, regional accent variants matter.
Cross-Lingual Voice Cloning
For products using voice cloning, cross-lingual cloning means a voice cloned in one language can synthesize text in another language. The cloned voice retains the speaker's characteristic timbre and prosody while adapting to the phoneme inventory of the target language. This is technically challenging and not universally supported.
Latency Per Language
Some TTS APIs route requests to language-specific models, which may have different infrastructure, latency profiles, or throughput limits. For consistent real-time performance across languages, Time-To-First-Audio (TTFA) should be consistent regardless of the language being synthesized.
How Do the Leading Multilingual TTS APIs Compare in 2026?
| Dimension | Gradium | ElevenLabs | Cartesia Sonic-3 | Deepgram Aura-2 |
|---|---|---|---|---|
| Languages | 5 (EN, FR, DE, ES, PT) | 32 | 40+ | 7 (EN, ES, FR, DE, NL, IT, JA) |
| Native fluency | Yes, all 5 languages | Varies by language | Varies by language | Varies by language |
| Code-switching | Native, mid-sentence | Limited | Not publicly documented | Not documented |
| Regional accents | Documented for supported languages | Some languages | Regional variants listed | Not documented |
| Cross-lingual voice cloning | EN, FR, DE, ES, PT (highest blinded-eval Elo) | 32 languages (fidelity varies) | 40+ languages | Not supported |
| TTFA (Coval P50) | 155 ms | 288 ms (Flash v2.5) | 188 ms | 313 ms |
| Pricing entry point | $0/month, 45,000 credits | 0.5 credits/char (Flash v2.5, credit-based) | $4/month (Pro, annual) | $0.030 per 1,000 chars |
P50 figures are from the independent Coval TTS leaderboard. For Gradium's own end-to-end measurements, see the published Time-To-First-Audio benchmark.
Who Is Gradium?
Gradium supports 5 languages with native fluency: English, French, German, Spanish, and Portuguese. Rather than optimizing for breadth, Gradium's multilingual capability is built around depth in each supported language and the ability to handle mixed-language content natively. The company was founded by Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez, co-founders of Kyutai, where they previously released Moshi, a real-time speech-to-speech model, and Hibiki, a live speech-to-speech translation model.
Native Fluency Across All Five Languages
Gradium's TTS produces native-fluency output in each of its 5 supported languages. The synthesized voice in French, German, Spanish, and Portuguese reflects the phoneme inventory and prosodic patterns of the target language, not an accented English baseline extended to other markets.
Native Mid-Sentence Code-Switching
Gradium supports mid-sentence code-switching across all 5 languages without a quality or latency penalty. A single API call can synthesize text that transitions between languages within the same utterance. For example, a voice agent serving bilingual speakers can move naturally between English and Spanish within a single turn without restructuring the pipeline or issuing separate TTS requests. This is particularly relevant for markets where code-switching is natural in spoken communication, such as US Spanish-English.
Cross-Lingual Voice Cloning
Gradium's voice cloning supports cross-lingual synthesis in English, French, Spanish, and German. A voice cloned in one of these languages can synthesize text in any of the others, maintaining the speaker's characteristic timbre while adapting to the target language phonology. In a blinded human evaluation benchmark of 3,220 voice pairs (890 sentences per language, 20 voices per language, 10-second source clips), Gradium's Instant Voice Clone achieved the highest Elo score across English, French, Spanish, and German, with an 8-11% speaker similarity advantage over comparison providers. For more on why most voice clones still sound fake and how this evaluation was constructed, see Why most voice clones sound fake.
Latency Independent of Language
Latency is consistent across all 5 supported languages. On the independent Coval benchmark, Gradium achieves a P50 TTFA of 155 ms. Gradium's own end-to-end measurements (P50 258 ms end-to-end, P50 214 ms excluding connection establishment, Paris, 15-25 word sentence, WebSocket, 100 queries, warm) are published in the Time-To-First-Audio benchmark. Routing does not select per-language models, the same model serves all 5 languages, which keeps the latency profile stable regardless of which language a given utterance contains, including across in-sentence code-switches.
Voice Library
Gradium ships a curated library of voices designed for voice agents, with consistent character across the 5 supported languages. Each voice supports cross-lingual synthesis across EN/FR/DE/ES/PT, so a single brand voice can answer in any of those languages without re-cloning, with a relevant accent in the target language.
Pricing and Plans
All 5 languages are included on every Gradium plan. Credits are consumed identically regardless of language. The free plan is $0/month with 45,000 credits (~1 hour of TTS or 4 hours of STT). Paid plans start at $13/month (XS). See gradium.ai/pricing for the current breakdown.
Best For
Gradium is the strongest fit for products targeting English, French, German, Spanish, or Portuguese markets where per-language voice quality, real-time streaming latency, and consistent code-switching behavior are required together. Validated use cases include multilingual customer support, international call translation, and bilingual voice agents where language transitions occur within a single conversational turn. On pricing, Gradium is approximately 3-4x less expensive than ElevenLabs for comparable TTS volume, which compounds as multilingual deployments scale.
Who Is ElevenLabs?
ElevenLabs supports 32 languages across its TTS models, the widest coverage option among providers focused on voice quality. Two models are relevant for multilingual deployments.
Flash v2.5 (Multilingual)
- 32 languages with broad coverage across European, Asian, and Latin American markets.
- TTFA P50 288 ms on the Coval benchmark, suited for real-time voice agents requiring wide language coverage.
- 0.5 credits per character (credit-based pricing; effective per-character cost varies by plan tier).
Multilingual v2
- 29 languages with the highest per-language voice quality in ElevenLabs' catalogue.
- TTFA P50 1,232 ms on Coval, not suited for real-time voice agents, but appropriate for batch content creation.
- 1 credit per character (credit-based pricing; effective per-character cost varies by plan tier).
Cross-Lingual Voice Cloning
ElevenLabs supports cross-lingual voice cloning: a voice cloned in one language can be used to synthesize text in any of the 32 supported languages. Quality of cross-lingual cloning varies by language pair and is generally higher for languages with larger training-data representation.
Best For
ElevenLabs is the strongest choice when broad language coverage (32 languages) is the primary requirement, particularly for content creation use cases (narration, dubbing, localization) where voice naturalness and access to a wide voice library matter more than real-time latency. Flash v2.5 is the option for multilingual voice agents at scale. See the full ElevenLabs alternative comparison for a deeper look at the trade-offs against Gradium.
Who Is Cartesia?
Cartesia supports 40+ languages with regional accent variants, the widest language coverage in this comparison.
Language and Accent Coverage
Cartesia's Sonic-3 model is documented for 40+ languages with explicit regional accent support, including Latin American Spanish alongside Castilian, Brazilian Portuguese alongside European Portuguese, American and British English, and multiple French variants. For products requiring consistent regional voice characteristics, not just general language support, Cartesia's regional variant documentation is more explicit than most providers.
Cross-Lingual Voice Cloning
Cartesia supports instant voice cloning (from 10 seconds of audio), with synthesis available in any of the 40+ supported languages.
Latency
Cartesia Sonic-3 delivers a P50 TTFA of 188 ms on the Coval benchmark. The State Space Model (SSM) architecture produces consistent latency across languages, including at P99.
Pricing
Pro at $4/month (100K credits, annual billing), Startup at $39/month (1.25M credits, annual billing), Scale at $239/month (8M credits, annual billing), approximately $0.03 per minute.
Best For
Cartesia is the strongest choice when the widest possible language and regional accent coverage is required, combined with low-latency streaming. Particularly suited to international products that must serve 10+ distinct language markets simultaneously with consistent voice quality. See the Cartesia alternative comparison for a side-by-side breakdown against Gradium.
Who Is Deepgram?
Deepgram Aura-2 supports 7 languages: English, Spanish, French, German, Dutch, Italian, and Japanese. No voice cloning is available.
Language Coverage
Aura-2's 7-language coverage is the narrowest in this comparison but covers the major European languages plus Japanese. For products targeting these specific markets and already using Deepgram Nova for STT, Aura-2 provides a consistent multilingual TTS layer without adding a new vendor.
Latency and Streaming
P50 TTFA of 313 ms on the Coval benchmark, with WebSocket streaming.
Pricing
$0.030 per 1,000 characters.
Best For
Deepgram Aura-2 is suited to teams already on the Deepgram platform who need TTS in up to 7 languages and do not require voice cloning or code-switching. See the Deepgram alternative comparison for a fuller breakdown.
Which Multilingual TTS API Should You Choose by Use Case?
Bilingual or Code-Switching Voice Agents
Gradium is the only provider in this comparison with documented mid-sentence code-switching support. For agents that serve bilingual speakers, such as US Spanish-English, or that operate in linguistically mixed environments, this is a structural requirement rather than a nice-to-have.
Global Product With 10+ Language Markets
Cartesia Sonic-3 at 40+ languages with regional accent variants is the strongest fit for maximum language breadth. ElevenLabs at 32 languages is the alternative if voice library depth and quality per language are more important than coverage count.
Multilingual Content Creation
ElevenLabs Multilingual v2 offers the highest per-language voice quality in this comparison and the broadest cross-lingual cloning library. Batch rendering and content creation workflows are better suited to Multilingual v2's quality profile, which trades TTFA for output fidelity.
Five Specific Languages With the Highest Per-Language Quality
Gradium's depth-first approach, 5 languages, native fluency, full feature parity across all 5, is the right choice when the target market is entirely within English, French, German, Spanish, or Portuguese.
Seven-Language Coverage on an Existing Deepgram Stack
Deepgram Aura-2 for teams already using Deepgram Nova who want to add TTS without vendor fragmentation.
How Does Gradium Handle Multilingual Voice Cloning?
Gradium ships two voice cloning modes: Instant and Pro. Instant Voice Cloning produces a usable voice from 10 seconds of source audio and supports cross-lingual synthesis across English, French, Spanish, and German, with up to 1,000 clones per month on paid plans. Professional Voice Cloning is fine-tuned on larger source datasets and is available from the M plan (5 Pro voices) and L plan (20 Pro voices) upwards. Both modes preserve speaker timbre across language boundaries, the same cloned voice can answer in English on one turn and Spanish on the next with consistent identity.
How Do You Configure Gradium TTS for Multilingual Streaming?
Gradium TTS streams over WebSocket, with the language detected automatically from the input text, no language flag is required, including for mid-sentence transitions. Behavior is tuned through json_config, which controls codebook depth, voice selection, pronunciation overrides, and text normalization. For long-form multilingual outputs (audiobooks, narration), pair json_config with pronunciation dictionaries and text normalization rules to handle proper nouns, code-mixed entities, and language-specific number formats. Multiple language streams can share a single connection via WebSocket multiplexing.
Deployment Options
Gradium offers four deployment surfaces with the same model and API across all four: cloud marketplace, private cloud, on-premise, and on-device. Multilingual behavior is identical across deployment modes, code-switching, native fluency, and cross-lingual voice cloning are not features that disappear in self-hosted environments. Teams with data residency or regulatory requirements (financial services, healthcare, public sector) can run the same multilingual TTS in their own infrastructure.
Gradium Pricing
All 5 languages are included on every Gradium plan, including the free tier. The free plan is $0/month with 45,000 credits (~1 hour of TTS or 4 hours of STT), 5 Instant Voice Clones, no commercial use. Paid plans start at $13/month (XS); plan M is $340/month, plan L is $1,615/month. The Startup Program offers seed-funded companies $2,000+ in free credits and 6 months of full API access at M-plan capacity (1,200 hours of TTS or 4,998 hours of STT). See gradium.ai/pricing for the current breakdown.
Getting Started
Gradium ships official Python and Rust SDKs, with first-class integrations for LiveKit and Pipecat. For multilingual voice agents, the standard pattern is to combine Gradium STT (with semantic VAD) and Gradium TTS over a single WebSocket session, letting the model handle in-sentence language transitions automatically. For broader context on selecting a Text-To-Speech API for voice agents, see the best Text-To-Speech API for voice agents. Create an account at gradium.ai to use the free tier across all 5 supported languages.