ElevenLabs Alternative: Why Developers Choose Gradium for Real-Time Voice AI

15 min read

Gradium is a real-time voice AI platform built by the co-founders of Kyutai. It offers streaming Text-To-Speech and streaming Speech-To-Text with semantic voice activity detection (VAD) over WebSocket, plus a REST voice cloning API that produces an instant clone from 10 seconds of audio. It is built specifically for voice agents rather than content creation, which is the main reason teams evaluate it against ElevenLabs.

Who is this for. Developers and technical teams who have built on ElevenLabs (Turbo v2.5, Flash v2.5, Multilingual v2, Scribe, Conversational AI) and are evaluating a provider whose stack is optimized for voice agents from the ground up: voice-agent-tuned TTS, STT including semantic VAD for turn-taking, accent-preserving voice cloning, served with low latency as illustrated by published TTFA benchmark against ElevenLabs' own models.

How Do Gradium and ElevenLabs Compare at a Glance?

Dimension Gradium ElevenLabs
Primary use case Real-time voice agents and developer voice APIs Content creation (audiobooks, dubbing, podcasts, voiceovers), expanding into voice agents with its own platform
TTS model TTS (streaming), designed for voice agents with robust handling of complex inputs (phone numbers, URLs, email addresses, dates, times, street addresses, order IDs, named entities) Turbo v2.5, Flash v2.5 (streaming), Multilingual v2 (studio-grade)
TTS latency (TTFA, matched-methodology benchmark) P50 258 ms end-to-end, P50 214 ms excluding connection establishment (published benchmark) Turbo v2.5 P50 304 ms; Flash v2.5 P50 324 ms; Multilingual v2 P50 706 ms (measured in Gradium's same benchmark)
Word-level timestamps (TTS) Yes, high-precision Yes
STT model STT (streaming) Scribe
Semantic VAD Yes, included in the STT Not documented as a core feature
Voice cloning Instant Voice Cloning + Professional Voice Cloning. Gradium's Instant Voice Clone has the highest Elo score in a blinded human evaluation benchmark against ElevenLabs across English, French, Spanish, and German Instant Voice Cloning + Professional Voice Cloning
Voice library Curated library of voices suited for voice agents Very large voice library, a flagship ElevenLabs strength
Languages English, French, Spanish, German, Portuguese, with regular updates 70+ languages
Mid-sentence code-switching Yes, no latency penalty Not documented
Agent framework integrations Platform-neutral: built to plug into any voice agent stack (LiveKit, Pipecat, and others) without preference Conversational AI is ElevenLabs' own native agent platform; third-party integrations available
SDKs Python, Rust (official) Multiple official SDKs (see ElevenLabs docs)
Deployment options Cloud marketplace, private cloud, on-premise, on-device, from the same model and API Cloud SaaS; enterprise deployment options available; SOC 2, GDPR, HIPAA-ready (enterprise)
Enterprise data control Zero data retention, SLA commitments (enterprise plans) Enterprise plans with data-handling commitments
Free plan $0/month, 45,000 credits (approximately 3x more audio than ElevenLabs' free tier) Free tier available
Founders Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez, co-founders of Kyutai Piotr Dąbkowski and Mati Staniszewski (founded 2022)

Who Is ElevenLabs?

ElevenLabs is a voice AI company founded in 2022 by Piotr Dąbkowski and Mati Staniszewski, originally known for studio-grade Text-To-Speech used in audiobooks, dubbing, podcasting, and voiceover work. Its TTS line includes Multilingual v2 for high-fidelity production use, Turbo v2.5 for lower-latency real-time use, and Flash v2.5 for the lowest-latency streaming. ElevenLabs has since expanded into Speech-To-Text (Scribe), voice cloning (Instant and Professional), and a Conversational AI platform for building voice agents. Language coverage is very broad (70+ languages) and the voice library is one of the largest in the market.

Who Is Gradium?

Gradium is a real-time voice AI platform for developers and companies deploying voice agents, built by researchers and co-founded by Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez. Its product surface is:

  • A streaming Text-To-Speech API over WebSocket
  • A streaming Speech-To-Text API with semantic voice activity detection
  • A voice cloning API, available as Instant (zero-shot, 10 seconds of audio) and Pro (fine-tuned model)

The TTS and STT APIs share a streaming-first WebSocket architecture, suitable for bidirectional, low-latency communication in production. Voice cloning is exposed via REST.

The founding team previously co-founded Kyutai, a research lab with peer-reviewed work on audio language models. Kyutai released world-first open systems including Moshi (real-time Speech-To-Speech) and Hibiki (live Speech-To-Speech translation). Gradium translates that research into production infrastructure.

What Should You Look for in an ElevenLabs Alternative?

TTS

  • Designed for voice agents, not content creation, and available across voice agent stacks. ElevenLabs' Text-To-Speech lineage is studio-grade content creation (audiobooks, dubbing, podcasts, voiceovers), with real-time models layered on top. Gradium's Text-To-Speech is built for real-time voice agents, with first-class integrations into LiveKit and Pipecat, rather than being focused on a single proprietary agent platform.
  • Pronunciation robustness on complex inputs. Most TTS models, including the ones tuned for audiobook and narration quality, fall short on the inputs that matter most in voice agents: phone numbers, URLs, email addresses, dates, times, street addresses, order IDs, named entities. Gradium's TTS is specifically tailored for these cases. You can further fine-tune pronunciation using pronunciation dictionaries and text normalization rules.
  • TTFA latency at or below 300 ms, delivered over streaming. For real-time voice agents, a TTFA under 300 ms combined with a true streaming API is the practical threshold for natural-feeling turn-taking. Gradium's published benchmark measures P50 258 ms end-to-end (214 ms excluding connection establishment). ElevenLabs Turbo v2.5 and Flash v2.5 are also streaming-capable; their measured P50 in the same benchmark is 304 ms and 324 ms respectively.
  • Streaming API. TTS should stream over WebSocket so audio can be delivered incrementally as it is generated. Both vendors offer WebSocket streaming, with different default consumption patterns (Gradium is streaming-first; ElevenLabs' core surface is request-response with streaming available on specific models). For high-throughput pipelines, Gradium also supports multiplexing multiple TTS requests over a single WebSocket connection.
  • Voice naturalness and quality, with a ready-to-use voice library. Production teams usually want expressive, natural-sounding voices available out of the box, without having to clone or tune for every project. Both Gradium and ElevenLabs ship curated voice libraries; ElevenLabs' library is among the largest in the market.

STT

  • Semantic voice activity detection. Knowing when a user has stopped speaking is not the same as knowing when they have finished a thought. Semantic VAD is the layer that enables natural turn-taking. It is native to Gradium's STT. ElevenLabs Scribe is positioned primarily as a transcription model rather than a turn-taking engine.
  • Streaming STT. Transcription should be available incrementally over WebSocket for real-time pipelines. Gradium's STT streams over WebSocket by default.

Voice Cloning

  • Clone fidelity, accent and style preservation. Sample audio requirements, clone fidelity, and accent or speaking style handling vary between providers.

Platform

  • Integrated stack. A production voice pipeline benefits from TTS, STT, and VAD in a single integrated provider.
  • Enterprise and deployment readiness. Private cloud, on-premise, on-device options, zero data retention, SLA commitments, and concurrency guarantees matter at production scale.
  • Pricing. On public pricing, Gradium is on average 30% cheaper than ElevenLabs. Enterprise tiers are custom on both sides.

Switching from ElevenLabs. Gradium's WebSocket TTS accepts the same streaming pattern: open a connection, send text, receive audio chunks. If you are already using ElevenLabs' WebSocket streaming, switching to Gradium requires updating the endpoint, voice ID, and authentication. The streaming flow is the same, and the json_config parameter gives you additional control over pronunciation, speed, and expressiveness that you can tune after migration. For high-throughput pipelines, Gradium also supports multiplexing multiple requests over a single WebSocket.

What Are the Key Differences Between Gradium and ElevenLabs?

How Does Gradium's TTS Compare to ElevenLabs for Voice Agents?

ElevenLabs' TTS lineage comes from studio-grade content creation: Multilingual v2 was built for high-fidelity audiobook and dubbing quality, and Turbo v2.5 and Flash v2.5 were introduced later to bring latency down for real-time use.

Voice agents rarely speak in clean prose. They have to say phone numbers, dates, times, URLs, email addresses, order IDs, confirmation codes, street addresses, dollar amounts, and named entities, and they have to say them correctly on the first attempt. A mispronounced confirmation code or a skipped digit in a phone number is a failure mode that surfaces directly to the user.

Gradium's TTS is tuned specifically for these cases. Phone numbers are grouped and intoned naturally instead of read as a flat digit string. URLs and email addresses are spelled out with correct handling of domains, dots, dashes, slashes, and special characters. Dates and times are pronounced in the regional convention of the target language. Complex named entities (company names, product names, abbreviations) are pronounced consistently across a session.

On latency, Gradium publishes a full TTFA benchmark with matched methodology across providers, measured from Paris on a 15-25 word sentence over WebSocket, 100 queries, warm state:

Model P50 TTFA (end-to-end) P50 TTFA (excluding connection establishment)
Gradium 258 ms 214 ms
ElevenLabs Turbo v2.5 304 ms 257 ms
ElevenLabs Flash v2.5 324 ms 277 ms
ElevenLabs Multilingual v2 706 ms 657 ms

Gradium is 46 ms faster at P50 than Turbo v2.5 and 66 ms faster than Flash v2.5 end-to-end, and clears the 300 ms streaming threshold with headroom. Multilingual v2 is a studio-grade model and is not intended for real-time use.

How Does Gradium's STT with Semantic VAD Compare to ElevenLabs Scribe?

In a real-time voice pipeline, turn-taking quality depends on detecting when the user has finished a complete thought, not just when they have gone silent. Semantic VAD is the mechanism that makes this possible. Without it, voice agents fall back on silence thresholds, which produce premature cut-offs or unnatural pauses.

Gradium's STT ships semantic VAD natively. ElevenLabs' Scribe, by contrast, is positioned primarily as a transcription model, leaving turn-taking to the surrounding agent pipeline or to ElevenLabs' Conversational AI platform.

How Does Voice Cloning Compare Between Gradium and ElevenLabs?

Gradium's voice cloning preserves the accent of the reference speaker from a single 10-second sample, rather than defaulting to a single standard pronunciation. Coverage spans the major regional accents of each supported language. In English, that includes American, British (RP and regional), Australian, Indian, Irish, Scottish, and South African. In French, it covers Metropolitan French, Quebecois, Belgian, Swiss French, and African French. In Spanish, Castilian, Mexican, Argentinian (including Rioplatense pronunciation), Colombian, and Caribbean. In German, High German, Austrian, Swiss German, and Bavarian. In Portuguese, European and Brazilian.

The same cloning pipeline captures speaking style as well as accent: conversational, narrative or audiobook, broadcast, customer-service, expressive and emotional, and whispered delivery. Whatever is in the 10-second sample is what the cloned voice will reproduce. Read more about Instant vs Pro Voice Cloning in Gradium.

ElevenLabs also offers Instant Voice Cloning and Professional Voice Cloning, with Professional tuned for high-fidelity content creation use. For voice-agent deployments where the clone needs to preserve a specific accent or delivery style, Gradium's cloning pipeline is tuned for that outcome.

Gradium has published a voice cloning benchmark comparing Instant Voice Cloning quality against ElevenLabs across English, French, Spanish, and German. The benchmark uses 890 sentences per language spanning three complexity levels (from simple conversational questions to sentences with rare named entities, URLs, email addresses, alphanumeric codes), 20 unique voices per language, and 10 seconds of source audio per clone. Human evaluators ran blinded A/B listening tests, comparing anonymized clones with the original recordings. 3,220 voice pairs were evaluated, feeding a live Elo ranking. Gradium's Instant Voice Clone achieved the highest Elo score in every language evaluated.

How Do Gradium and ElevenLabs Compare on Language Support?

ElevenLabs supports 70+ languages, one of the broadest coverages in the market, which is a natural fit for content creation workflows that need to render audio in many languages. Gradium supports five with native fluency: English, French, Spanish, German, and Portuguese. Gradium adds mid-sentence code-switching across all five, with no latency penalty and no quality degradation. A speaker can shift language within a single sentence and Gradium handles it without a quality drop.

For broad multilingual content production across 70+ languages, ElevenLabs is the stronger fit. For voice agents in the five languages Gradium supports, Gradium goes deeper on accent preservation, mid-sentence code-switching, and voice-agent-tuned pronunciation.

How Does Gradium's Platform Neutrality Compare to ElevenLabs' Conversational AI?

Gradium builds voice models and APIs; it does not build its own voice agent platform. The focus is to enable every voice platform equally well, with first-class integrations into LiveKit, Pipecat, and the other orchestration layers teams are already using in production. ElevenLabs, by contrast, has launched its own Conversational AI platform and prioritizes it as the native surface for voice agents, with third-party integrations positioned alongside. Teams that want to compose their own voice agent stack, or that already depend on an orchestration layer they do not want to migrate away from, tend to prefer a platform-neutral voice provider.

How Do Deployment Options Compare?

Gradium is available across four deployment surfaces from the same model and API:

  • Cloud marketplace for standard SaaS consumption.
  • Private cloud for tenant-isolated deployments inside a customer's account.
  • On-premise for teams with strict data-residency or air-gapped requirements.
  • On-device for latency-critical or offline use cases where the model runs locally.

A pilot that ships on the cloud can move to on-premise or on-device without re-architecting the pipeline. ElevenLabs offers enterprise deployment options with data-handling commitments and relevant certifications. Gradium's differentiator is the breadth of explicitly supported surfaces, in particular on-device, delivered from a single model and API.

Gradium TTS: Streaming Text-To-Speech for Real-Time Applications

Gradium's TTS is built for streaming delivery. It connects via WebSocket and supports bidirectional communication, which is the architecture required for voice agents that must speak while staying ready for the next user input.

Capabilities:

  • Real-time streaming via WebSocket. Audio is delivered incrementally as it is generated, not after the full sentence is complete.
  • Expressive speech with robust pronunciation. Designed to handle phone numbers, URLs, email addresses, dates, times, and named entities, the inputs that break most TTS models in agent pipelines.
  • Published TTFA benchmark. P50 258 ms end-to-end, P50 214 ms excluding connection establishment, measured on matched methodology against ElevenLabs Turbo v2.5, Flash v2.5, Multilingual v2, Mistral Voxtral TTS, and OpenAI GPT-4o Mini.
  • High-precision word-level timestamps. Useful for subtitling, lipsync, and interactive transcript display.
  • Multiple output formats for different integration surfaces.
  • Advanced configuration via json_config for controlling speed, expressiveness, voice similarity, and text normalization.

The TTS API is available via Python SDK, Rust SDK, and direct WebSocket integration, and is compatible with LiveKit and Pipecat.

Read more: Stream Text-To-Speech with the Gradium WebSocket API, Time to First Audio benchmark.

Gradium STT: Speech-To-Text with Semantic VAD

Gradium's STT combines streaming transcription with semantic VAD, a mechanism that determines when a speaker has finished a thought, not just stopped making sound.

In conversational AI, a standard VAD cuts off after a silence threshold, so the system either interrupts mid-sentence or waits too long. Semantic VAD uses the intent of the utterance to trigger turn-taking at the right moment.

Capabilities:

  • Best-in-class accuracy with controllable latency
  • Robust performance in noisy environments, designed for real-world deployment
  • Semantic VAD for smart turn-taking
  • Streaming via WebSocket

Read more: Real-time speech transcription with the Gradium WebSocket API.

Voice Cloning: Instant and Pro

Gradium offers two voice cloning tiers.

Instant Voice Clone. Create a custom voice from as little as 10 seconds of audio. The clone is immediately available for TTS generation via the API. All paid plans include up to 1,000 Instant Voice Clones per month. In a blinded human evaluation benchmark, Gradium's Instant Voice Clone achieved the highest Elo score against ElevenLabs across English, French, Spanish, and German.

Pro Voice Clone. A fine-tuned model trained on more audio, designed to be indistinguishable from the original speaker. Gradium positions Pro Voice Clones as the highest speaker-similarity option on the market. Pro clones are available from the M plan ($340/month, 5 included) and L plan ($1,615/month, 20 included).

Both clone tiers preserve the accent and speaking style of the reference sample within each of Gradium's five supported languages (see the key differences section for the full accent list).

Both clone types are accessible via the REST API, the Python SDK, or Gradium Studio. Read more about Instant vs Pro Voice Cloning.

Read more: Instant voice cloning with the Gradium API.

Languages: Native Fluency Across Five Languages

Gradium supports five languages with native fluency: English, French, Spanish, German, and Portuguese.

Mid-sentence code-switching is supported across all five, with no latency penalty and no quality degradation. This is relevant for multilingual AI companions, international customer support agents, and language learning applications.

For deployments that require coverage across a broader set of languages, ElevenLabs supports 70+ languages and is the stronger fit, particularly for content production.

Deployment Options

Gradium supports four deployment surfaces from the same model and API:

  • Cloud marketplace. Fastest path to production with standard SaaS consumption.
  • Private cloud. Tenant-isolated deployment inside the customer's cloud account.
  • On-premise. For teams with strict data-residency, regulatory, or air-gapped requirements (healthcare, financial services, defense).
  • On-device. For latency-critical or offline use cases where the model runs locally on the end-user device.

Enterprise plans also include zero data retention and SLA commitments. ElevenLabs offers enterprise deployment options with data-handling commitments. Gradium's differentiator is the breadth of explicitly supported surfaces, in particular on-device, delivered from a single model and API.

Gradium Pricing

Gradium offers a free tier and credit-based paid plans starting at $13/month. On public pricing, Gradium is on average 30% cheaper than ElevenLabs. For full pricing details, see the Gradium pricing page.

Gradium also runs a Startup Program: seed-funded startups can apply for $2,000+ in free credits, 6 months of full API access, direct engineering support, and early model access.

Who Should Choose Gradium Over ElevenLabs?

Choose Gradium if you are:

TTS-driven.

  • Shipping a voice agent that needs robust pronunciation on complex inputs (phone numbers, URLs, email addresses, dates, times, street addresses, order IDs, named entities) across English, French, Spanish, German, and Portuguese, not just on clean narration
  • Evaluating TTS with a published, reproducible latency benchmark that includes head-to-head matched-methodology numbers against ElevenLabs Turbo v2.5, Flash v2.5, and Multilingual v2

STT-driven.

  • Building conversational AI agents where natural turn-taking is the key quality driver, and native semantic VAD is a hard requirement

Voice-cloning-driven.

  • Cloning voices that need to preserve a specific accent or speaking style within a language, rather than defaulting to a single standard pronunciation

Platform-driven.

  • Working in Python or Rust and want officially supported SDKs
  • Integrating with LiveKit or Pipecat and want a voice layer that connects natively
  • Deploying across multiple surfaces (cloud marketplace, private cloud, on-premise, on-device) from a single model and API
  • Requiring enterprise-grade data control, with zero data retention and SLA commitments on enterprise plans
  • A seed-funded startup looking for production-grade voice AI with onboarding credits of $2,000+ and 6 months of full API access

Also comparing Cartesia or Deepgram? See our dedicated comparison pages.

ElevenLabs remains the stronger choice if:

  • Your primary use case is offline or near-offline content production (audiobooks, dubbing, podcasting, voiceover) where studio-grade quality on long-form reading is the key requirement
  • You need coverage across 70+ languages
  • You depend on the very large ElevenLabs voice library

Frequently Asked Questions