How much audio is needed to clone a voice?

The best instant voice cloning APIs in 2026 require as little as 10 seconds of clean audio. Gradium and Cartesia both specify a 10-second minimum for instant clones. Professional or fine-tuned cloning typically requires significantly more audio (often hours) for higher fidelity results.

Is voice cloning legal?

Voice cloning is legal when performed with the explicit consent of the person whose voice is being cloned. Without consent, cloning a voice may violate privacy laws (GDPR, CCPA), biometric data regulations such as Illinois BIPA, or intellectual property rights. Always obtain explicit, documented consent before collecting or processing voice data for cloning purposes.

Which voice cloning API has the best speaker similarity?

Gradium published a benchmark of 3,220 blinded human evaluations across English, French, Spanish, and German, achieving the highest Elo score in all four languages. Elo-based benchmarks derived from head-to-head human comparisons are the most reliable measure of speaker similarity because they are relative (not dependent on absolute scores) and resistant to evaluator bias.

Can cloned voices be used for real-time voice agents?

Yes. Gradium's Instant Voice Clone is integrated directly into its WebSocket Text-To-Speech API: cloned voices are used in the same API call as standard voices, with the same time to first audio of P50 258 ms end-to-end (P50 214 ms excluding connection establishment) in Gradium's published Paris benchmark. Cartesia Sonic-3 also supports cloned voices in real-time streaming. Both are suitable for real-time voice agent deployments with personalized or branded voices.

Is voice cloning available on free plans?

Gradium is the only provider in this comparison to offer voice cloning on a free tier: 5 Instant Voice Clones are included on the free plan with no credit card required, alongside 45,000 credits per month with no commercial use. Cartesia includes instant voice cloning on its entry-level paid Pro plan. ElevenLabs does not include voice cloning on its free tier.

Does voice cloning work across multiple languages?

Yes, to varying degrees. Gradium supports cross-lingual synthesis in five languages (English, French, German, Spanish, Portuguese) with mid-sentence code-switching and no latency penalty. ElevenLabs supports 32 languages for cross-lingual voice cloning. Cartesia supports 40+ languages. Cross-lingual quality (a voice cloned in one language synthesizing text in another) varies by provider and by language pair.

How are cloned voices stored and protected?

Voice clone data storage and handling policies vary by provider. Gradium, ElevenLabs, and Cartesia all operate under acceptable use policies that restrict unauthorized cloning. For products handling user voice data, developers should review each provider's data processing agreement and ensure compliance with applicable privacy regulations for their target markets, including GDPR, CCPA, and biometric-specific laws such as Illinois BIPA.

What SDKs and integrations does Gradium offer for voice cloning?

Gradium provides official Python and Rust SDKs and ships first-class integrations with LiveKit and Pipecat. Cloned voices are used through the same WebSocket Text-To-Speech endpoint as standard voices, so no separate cloning SDK is required.

Where can Gradium voice cloning be deployed?

Gradium offers four deployment options with the same model and the same API surface: cloud marketplace, private cloud, on-premise, and on-device. Voice clones are usable across all four deployment models, which matters for regulated industries that need to keep biometric voice data inside their own infrastructure.

How much does Gradium voice cloning cost?

Voice cloning is included on every Gradium plan. The free plan ($0 per month, 45,000 credits) includes 5 Instant Voice Clones with no commercial use. Paid plans start at $13 per month (XS) and include up to 1,000 Instant Voice Clones per month. Pro Voice Clone is available from the M plan ($340 per month, 5 Pro clones) and the L plan ($1,615 per month, 20 Pro clones). See gradium.ai/pricing for current allocations.

How do I get started with Gradium voice cloning?

Sign up on gradium.ai, create an Instant Voice Clone by uploading a 10-second audio sample with documented consent from the speaker, then pass the resulting clone ID to the streaming Text-To-Speech WebSocket. Gradium's free tier includes 5 Instant Voice Clones with no credit card required, which is enough to validate an end-to-end integration before adopting a paid plan.

Best Voice Cloning APIs in 2026: Instant Cloning, Fine-Tuning, Benchmarks

Q: What is a voice cloning API?

A voice cloning API creates a synthetic voice that replicates the characteristics of a real speaker from an audio sample. The resulting clone can then be used as the voice model for Text-To-Speech synthesis, generating any text in the target speaker's voice. The two main types are instant cloning (available immediately from a short sample) and professional cloning (fine-tuned from more data, higher fidelity).

Q: What is the difference between instant voice cloning and professional voice cloning?

Instant voice cloning creates a usable clone from a short audio sample (10 seconds to a few minutes) without model retraining and is available immediately after upload. Professional voice cloning trains a dedicated model on more audio data, producing higher speaker similarity. Instant cloning is suited to personalized agents and user-generated voices, while professional cloning is suited to flagship brand voices and high-fidelity consumer products.

Q: Who founded Gradium?

Gradium was founded by Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez. They are co-founders of Kyutai, the Paris-based non-profit AI research lab, where they shipped Moshi (real-time Speech-To-Speech) and Hibiki (live Speech-To-Speech translation).

Voice cloning lets developers create a synthetic replica of a real human voice from a short audio sample. In 2026, the best voice cloning APIs can produce convincing clones from as little as 10 seconds of audio and make the resulting voice available for real-time Text-To-Speech streaming inside a voice agent, an audiobook pipeline, or a personalized assistant.

This guide compares the leading voice cloning APIs available to developers: Gradium, ElevenLabs, and Cartesia, with notes on providers that do not yet expose public voice cloning (Deepgram and OpenAI). Each is evaluated on clone quality, minimum sample requirements, integration with streaming Text-To-Speech, pricing, language coverage, and data handling. The guide also covers the ethical obligations that apply to any production use of voice cloning technology.

What Is a Voice Cloning API?

A voice cloning API creates a custom synthetic voice that replicates the characteristics of a real speaker: timbre, accent, cadence, and prosody. Once created, the cloned voice can be used as the voice model for a Text-To-Speech API, synthesizing arbitrary text in the target speaker's voice.

In practice, voice cloning APIs fall into two categories:

Instant voice cloning. A zero-shot or few-shot approach that creates a usable clone from a short audio sample (typically 10 seconds to a few minutes) without model retraining. Available immediately after upload.
Professional voice cloning (also called fine-tuned cloning). A higher-fidelity approach that trains a model specifically on a larger audio dataset from the target speaker. Produces better speaker similarity at the cost of more time and a higher price point.

A deeper walkthrough of the trade-offs between the two approaches is available in our guide to instant vs. professional voice cloning.

How Do Instant and Professional Voice Cloning Compare?

Dimension	Instant cloning	Professional cloning
Sample required	10 seconds to a few minutes	Typically hours of audio
Availability	Immediate (seconds after upload)	Hours to days (training time)
Speaker similarity	Good to very good	Very good to excellent
Cost	Low (included in standard plans)	Higher (dedicated plan tier)
Best for	Prototyping, personalized agents, scale	Brand voices, public-facing products

For most developer use cases (personalized AI assistants, voice agents, gaming characters, user-specific avatars), instant cloning provides sufficient quality. Professional cloning is reserved for flagship brand voices or consumer products where speaker accuracy is the primary differentiator.

What Should You Look for in a Voice Cloning API?

Minimum Sample Duration

The amount of audio required to create an acceptable clone varies between providers. A shorter minimum sample reduces the friction of voice collection, which matters for products where users provide their own voice.

Speaker Similarity Quality

Speaker similarity measures how closely the synthesized voice matches the original speaker. Common evaluation methods include:

Elo rating. Derived from blinded human comparisons where evaluators choose the more similar voice between two candidates. More robust than single-score metrics because it accounts for relative quality across many pairs.
MOS (Mean Opinion Score). A 1–5 human rating of overall audio quality. Useful but does not isolate speaker similarity.
SMOS (Speaker Mean Opinion Score). A variant of MOS specifically measuring how well the synthesized voice matches the target speaker.

We explore why some clones sound fake even when the underlying model is strong in a separate post.

How Fast Is the Clone Available?

For applications where users create their own voice profiles (personalized agents, interactive products, in-game characters), the time between sample upload and clone availability directly impacts the end-user experience.

Integration with Streaming Text-To-Speech

A voice clone is only useful if it can be driven by a low-latency streaming Text-To-Speech API. Providers that offer cloning and Text-To-Speech in the same platform eliminate the integration overhead, authentication complexity, and latency budget cost of stitching separate services together.

Pricing and Clone Limits

Some providers charge per clone created; others include a number of clones in plan tiers. For products that generate many user-specific voices, per-clone pricing can become a significant cost driver at scale.

Voice cloning APIs process biometric data. Responsible providers require explicit consent from the person whose voice is being cloned and offer clear data handling policies. Depending on jurisdiction, voice data may be subject to GDPR, CCPA, or biometric privacy laws such as Illinois BIPA.

How Do the Best Voice Cloning APIs Compare in 2026?

Provider	Instant cloning	Min. sample	Professional cloning	Clones on free/entry tier	Integrated streaming TTS	Languages
Gradium	Yes	10 seconds	Yes (M plan and above)	5 (free tier, no card)	Yes, WebSocket, real-time	5 (EN, FR, DE, ES, PT)
ElevenLabs	Yes (paid plans only)	Not published	Yes (Creator plan and above)	Paid plans	Yes	32
Cartesia	Yes	10 seconds	Not publicly documented	Included (all paid plans)	Yes, WebSocket and REST	40+
Deepgram Aura-2	No	N/A	No	N/A	Yes (built-in voices only)	7
OpenAI TTS	No (public)	N/A	Limited preview only	N/A	Yes (HTTP streaming)	Multiple (EN-optimized)

Who Is Gradium?

Gradium is a real-time voice AI platform built on the Delayed Streams Modeling research family (arXiv:2509.08753). The company was founded by Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez, co-founders of Kyutai, the Paris-based non-profit AI research lab that shipped Moshi (real-time Speech-To-Speech) and Hibiki (live Speech-To-Speech translation). Voice cloning is integrated directly into Gradium's TTS and STT stack, accessible through the same WebSocket API used for synthesis.

Instant Voice Clone

Gradium's Instant Voice Clone creates a custom voice from a minimum of 10 seconds of audio. The clone is available within seconds of upload and can immediately drive real-time Text-To-Speech streaming. The cloning engine preserves fine-grained vocal micro-traits including vocal fry, rasp, breathiness, pitch dynamics, and accent characteristics, which is what most clones sacrifice when they sound generic.

Instant Voice Clones are available on every Gradium plan:

Free tier. 5 Instant Voice Clones, no commercial use, no credit card required.
XS, S, M, L plans. Up to 1,000 Instant Voice Clones per month.

This makes Gradium the only provider in this comparison to offer voice cloning access on a free tier.

Pro Voice Clone

Gradium's Pro Voice Clone is a fine-tuned model trained specifically on the target speaker's audio data, producing higher speaker fidelity than instant cloning. Pro Voice Clone is available from the M plan (5 Pro clones included) and the L plan (20 Pro clones included). See Gradium pricing for current allocations.

Speaker Similarity Benchmark

Gradium's voice cloning was evaluated in a benchmark of 3,220 blinded human evaluations across four languages: English, French, Spanish, and German. Each language used 890 sentences and 20 voices, with a 10-second source sample per voice, scored under a live Elo ranking. Gradium achieved the highest Elo score in all four languages, representing an 8–11% advantage in speaker similarity over the comparison providers included in the benchmark.

Integration

Cloned voices are used directly inside Gradium's streaming Text-To-Speech API over WebSocket. There is no separate cloning runtime to call: the clone ID is passed as a parameter in the TTS request, typically through the json_config field on the WebSocket connection. Time to first audio with a cloned voice is the same as with a standard voice: P50 258 ms, P95 274 ms end-to-end; P50 214 ms, P95 228 ms excluding connection establishment (published benchmark, Paris, 15–25 word sentence, WebSocket, 100 queries, warm). High-throughput agents can multiplex multiple TTS streams on a single WebSocket for further efficiency, and runtime behaviour around tricky inputs is controlled through pronunciation dictionaries and text normalization rules.

Best For

Gradium is a strong choice for developers who need voice cloning accessible from the free tier, want cloning and Text-To-Speech integrated in a single API, require real-time streaming with cloned voices, or are building products where users create their own voice profiles at scale. The combination of per-language clone quality, real-time streaming latency, and pricing makes it well suited for voice agent deployments at scale. Dedicated head-to-heads are available for Gradium vs. ElevenLabs, Gradium vs. Cartesia, and Gradium vs. Deepgram.

Who Is ElevenLabs?

ElevenLabs is the most widely recognized provider for voice cloning in the content creation market. Its instant and professional cloning capabilities are part of the same platform as its Text-To-Speech models.

Instant Voice Cloning

ElevenLabs Instant Voice Cloning is available on all paid plans (not the free tier). A short audio sample is sufficient to create a usable clone. The clone can then be used with any ElevenLabs Text-To-Speech model, including Turbo v2.5 (P50 304 ms), Flash v2.5 (P50 324 ms), and Multilingual v2 (P50 706 ms) when benchmarked under Gradium's published methodology.

Professional Voice Cloning

ElevenLabs Professional Voice Cloning (also referred to as fine-tuned or high-fidelity cloning) is available from the Creator plan and above. This tier trains a dedicated model on more audio data, producing higher speaker similarity for flagship use cases such as branded audiobooks, narrator continuity, or signature consumer voices.

Language Support

32 languages. ElevenLabs' multilingual voice cloning supports cross-lingual synthesis: a voice cloned in one language can synthesize text in other supported languages, though fidelity varies by source language and target language.

Best For

ElevenLabs is the strongest choice for content creation use cases (audiobooks, dubbing, narration) where voice naturalness and a broad language catalogue are the primary requirements. The Scale tier and above provide professional cloning for high-fidelity brand voice applications. Teams evaluating ElevenLabs specifically for voice agents will want to compare against the Gradium ElevenLabs alternative analysis.

Who Is Cartesia?

Cartesia offers instant voice cloning as part of its Sonic-3 Text-To-Speech platform. Cloning is included in all paid plan tiers, starting with the entry-level Pro plan.

Instant Voice Cloning

Cartesia's instant voice cloning creates a clone from a minimum of 10 seconds of audio. The clone is available immediately for use with Sonic-3, Cartesia's State Space Model–based Text-To-Speech engine with low time to first audio (vendor claim).

The SSM architecture is designed to maintain consistent low latency, including with cloned voices, which Cartesia positions as a P99 advantage over transformer-based providers.

Language Support

40+ languages. Cloned voices can synthesize in any supported language, with regional accent variants available for several languages.

Pricing

Voice cloning is included in all Cartesia paid plans (Pro, Startup, Scale). See the provider for current credit allocations per tier.

Best For

Cartesia is a strong choice for teams who need instant voice cloning with low Text-To-Speech latency and broad language coverage (40+), with cloning included from entry-level pricing. A side-by-side breakdown is available in our Cartesia alternative analysis.

Which Providers Do Not Yet Offer Public Voice Cloning?

Deepgram Aura-2

Deepgram Aura-2 does not expose voice cloning. The platform provides a fixed set of built-in voices for Text-To-Speech synthesis. Teams running on Deepgram who require voice cloning need to use a separate provider for that layer, which adds an integration boundary, an extra latency budget, and an extra vendor relationship. See our Deepgram alternative analysis for the full comparison.

OpenAI TTS

OpenAI's Voice Engine (its voice cloning technology) remains in a limited preview program as of 2026 and is not available as a standard developer API. OpenAI TTS (tts-1, tts-1-hd, gpt-4o-mini-tts) uses a fixed set of built-in voices. Custom voice creation is not publicly accessible.

What Are the Ethical Considerations for Voice Cloning?

Voice cloning creates a synthetic model of a real person's voice. That carries responsibilities distinct from other AI capabilities.

No voice should be cloned without the explicit, informed consent of the person it belongs to. This applies regardless of whether the voice is publicly available (e.g., from a podcast, public speech, or social media). Responsible voice cloning APIs enforce consent requirements contractually and, where feasible, technically.

Legal Obligations

Depending on jurisdiction, voice data may qualify as biometric data under privacy regulations including GDPR (EU), CCPA (California), and biometric-specific laws such as Illinois BIPA. Developers building products that collect and process user voice data for cloning should obtain legal guidance specific to their target markets.

Misuse Prevention

Voice cloning technology can be misused to create synthetic audio that impersonates real individuals. Responsible use requires:

Recording explicit consent at the time of audio collection, with a documented record kept on file.
Limiting clone use strictly to the purposes disclosed to, and authorized by, the consenting individual.
Avoiding endorsement-like content that the original speaker would not approve, especially political, commercial, or sensitive contexts.

Gradium, ElevenLabs, and Cartesia all require users to agree to acceptable use policies that prohibit unauthorized voice cloning.

Gradium Pricing

Gradium's free tier includes 45,000 credits per month (about 1 hour of TTS or 4 hours of STT) and 5 Instant Voice Clones, no credit card required and no commercial use. Paid plans start at $13 per month (XS) and scale up to enterprise tiers. Pro Voice Clone is available from the M plan ($340 per month, 5 Pro clones) and the L plan ($1,615 per month, 20 Pro clones). See gradium.ai/pricing for current credit allocations, clone limits, and feature availability per tier. Seed-funded teams can also apply to the Gradium Startup Program for $2,000+ in free credits and six months of full API access.

How Should You Choose the Right Voice Cloning API?

Choose Gradium if you need voice cloning accessible from a free tier, want cloning and Text-To-Speech integrated in a single API and a single WebSocket, are building products where users create their own voice profiles at scale, require cloning across English, French, German, Spanish, or Portuguese, or want a well-documented speaker similarity benchmark (3,220 blinded evaluations, highest Elo in EN/FR/DE/ES/PT). Gradium is also the right pick when you need to ship a voice agent on top of LiveKit or an audio pipeline on top of Pipecat without stitching together separate cloning and synthesis vendors.

Choose ElevenLabs if voice naturalness for content creation (audiobooks, narration, dubbing) is the primary use case, you need cross-lingual cloning across 32+ languages, or professional-grade fine-tuned cloning for flagship brand voices is required.

Choose Cartesia if you need low Text-To-Speech latency with cloned voices, require broad language coverage (40+), and want cloning included from entry-level pricing.

For a broader view of the Text-To-Speech market beyond voice cloning, see our overview of the best Text-To-Speech API for voice agents.

What Is a Voice Cloning API?

How Do Instant and Professional Voice Cloning Compare?

What Should You Look for in a Voice Cloning API?

Minimum Sample Duration

Speaker Similarity Quality

How Fast Is the Clone Available?

Integration with Streaming Text-To-Speech

Pricing and Clone Limits

Data Privacy and Consent

How Do the Best Voice Cloning APIs Compare in 2026?

Who Is Gradium?

Instant Voice Clone

Pro Voice Clone

Speaker Similarity Benchmark

Integration

Best For

Who Is ElevenLabs?

Instant Voice Cloning

Professional Voice Cloning

Language Support

Best For

Who Is Cartesia?

Instant Voice Cloning

Language Support

Pricing

Best For

Which Providers Do Not Yet Offer Public Voice Cloning?

Deepgram Aura-2

OpenAI TTS

What Are the Ethical Considerations for Voice Cloning?

Consent Is Non-Negotiable

Legal Obligations

Misuse Prevention

Gradium Pricing

How Should You Choose the Right Voice Cloning API?

Frequently Asked Questions