Does Gradium offer voice cloning? Does Deepgram?

Gradium offers two voice cloning tiers: Instant Voice Clone (from 10 seconds of audio) and Pro Voice Clone (a fine-tuned model for highest fidelity). Deepgram does not offer a voice cloning product. Explicit consent is required before cloning any voice.

How does Gradium voice cloning compare on quality?

Gradium has published a voice cloning benchmark of Instant Voice Clone quality: 890 sentences per language, 20 voices per language, 10-second source audio, blinded A/B listening tests, 3,220 voice pairs evaluated, live Elo ranking. Gradium's Instant Voice Clone achieved the highest Elo score in every language evaluated (English, French, Spanish, German). Deepgram is not in the benchmark because it does not offer voice cloning.

Can Gradium clone accents and speaking styles within a language?

Yes. Gradium's voice cloning reproduces the accent and the speaking style present in the 10-second reference sample, within each of its five supported languages, rather than defaulting to a single standard pronunciation. This covers the major regional accents in English, French, Spanish, German, and Portuguese, and speaking styles including conversational, narrative, broadcast, customer-service, expressive, and whispered delivery.

What is semantic VAD and why does it matter when comparing Gradium to Deepgram?

Semantic voice activity detection determines not just when a speaker has gone silent, but when they have finished a complete thought. In conversational AI, this prevents the system from cutting in mid-sentence. Gradium's STT includes semantic VAD natively. Deepgram's Flux is designed for conversational voice agents and uses semantic and acoustic cues for end-of-turn detection, though it is not marketed under the "semantic VAD" label.

How does Gradium STT compare to Deepgram Nova-3 and Flux?

Both offer real-time streaming STT. Deepgram's Nova-3 supports 45+ languages, a significantly broader set than Gradium's five (English, French, Spanish, German, Portuguese). Flux is designed for voice agents with turn detection and interruption handling. Gradium's STT includes semantic VAD natively and is integrated with Gradium's TTS and voice cloning in the same streaming architecture.

How does Gradium compare to Deepgram Aura on TTS?

Gradium's TTS was built alongside the STT as a streaming-first product from day one. Deepgram Aura (Aura-1, Aura-2) is available as part of Deepgram's expanded surface on top of an STT-first platform. Gradium publishes a TTFA benchmark with matched methodology (P50 258 ms end-to-end, P50 214 ms excluding connection establishment, measured from Paris on a 15-25 word sentence over WebSocket, 100 queries, warm state).

What SDKs does Gradium offer?

Gradium provides official SDKs in Python and Rust, and integrates natively with LiveKit and Pipecat.

Can Gradium be deployed on-premise?

Yes. Gradium supports the full range of deployment surfaces: cloud marketplace, private cloud (tenant-isolated in the customer's account), on-premise (including air-gapped), and on-device (for latency-critical or offline use cases). The same model and API surface is available across all four, so a deployment can move from cloud to on-premise or on-device without re-architecting. Enterprise plans also include zero data retention and SLA commitments. Deepgram also offers self-hosted enterprise deployment.

Does Gradium offer a free plan?

Yes. Gradium's Free plan includes 45,000 credits per month (approximately 1 hour of TTS or 4 hours of STT), 5 Instant Voice Clones, and full API and Studio access. Commercial use is not included on the Free plan. Deepgram offers $200 in free credits with no monthly commitment, then pay-as-you-go.

How does Gradium pricing work?

Gradium uses a credit system: 1 character of TTS equals 1 credit, and 1 second of STT equals 3 credits. Plans range from Free ($0/month, 45,000 credits) to Tailored (custom pricing). Additional credits can be purchased pay-as-you-go on any paid plan.

Where can I get started with Gradium?

Gradium is available at gradium.ai. The Free plan ($0/month, 45,000 credits) is the fastest way to test the API. Integration guides are available for streaming TTS, streaming STT, and instant voice cloning.

Deepgram Alternative: Why Developers Choose Gradium for Real-Time Voice AI

Gradium is a real-time voice AI platform built by the co-founders of Kyutai. It offers streaming Text-To-Speech and streaming Speech-To-Text with semantic voice activity detection (VAD) over WebSocket, plus a REST voice cloning API that produces an instant clone from 10 seconds of audio. Voice cloning and native semantic VAD are the two capabilities most teams evaluate when comparing Gradium to Deepgram, since Deepgram does not currently ship a voice cloning product.

Who is this for. Developers and technical teams who have built on Deepgram (Nova-3 STT, Flux for voice agents, Aura TTS, Voice Agent API) and are evaluating a provider that adds voice cloning and native semantic VAD to a streaming-first TTS and STT stack, and who prefer composable voice models over a bundled Voice Agent API.

How Do Gradium and Deepgram Compare at a Glance?

Dimension	Gradium	Deepgram
Primary use case	Real-time voice agents and developer voice APIs	Speech-To-Text for transcription and voice agents, with Aura TTS and a bundled Voice Agent API
Primary strength	Integrated TTS + STT + voice cloning platform, streaming-first	STT-first platform (Nova-3, Flux) with Aura TTS and Audio Intelligence
TTS model	Streaming TTS via WebSocket, designed for voice agents with robust handling of complex inputs (phone numbers, URLs, email addresses, dates, times, street addresses, order IDs, named entities)	Aura-1, Aura-2
TTS latency (TTFA)	P50 258 ms end-to-end, P50 214 ms excluding connection establishment (published benchmark)	Not published with matched methodology
STT model	Streaming STT via WebSocket, included in all plans	Nova-3 (45+ languages), Flux (voice agents, turn detection, interruption handling)
Semantic VAD	Yes, native to the STT, enables smart turn-taking	Flux uses semantic and acoustic cues for end-of-turn detection; not marketed under the "semantic VAD" label
Voice cloning	Instant (10 s of audio) + Pro (fine-tuned). Gradium's Instant Voice Clone achieved the highest Elo score in a blinded human evaluation benchmark across English, French, Spanish, and German	Not available
Audio Intelligence	Not available (Gradium focuses on transcription and semantic VAD)	Yes: summarization, sentiment analysis, intent recognition, topic detection
Unified Voice Agent API	No (composable TTS + STT APIs, compatible with any orchestration)	Yes (bundled STT + TTS + LLM orchestration)
Mid-sentence code-switching	Yes, no latency penalty	Not documented
Languages	English, French, Spanish, German, Portuguese, with regular updates	STT: 45+ languages (Nova-3). TTS: English (Aura)
Agent framework integrations	Platform-neutral: built to plug into any voice agent stack (LiveKit, Pipecat, and others) without preference	LiveKit, Pipecat, and the Deepgram Voice Agent API
Word-level timestamps	Yes, high-precision (TTS and STT)	Yes (STT); not documented for Aura TTS
SDKs	Python, Rust (official)	Multiple official SDKs (see Deepgram docs)
Deployment options	Cloud marketplace, private cloud, on-premise, on-device, from the same model and API	Cloud SaaS; self-hosted enterprise deployment
Enterprise data control	Zero data retention, SLA commitments (enterprise plans)	SOC 2 Type II, HIPAA, GDPR, CCPA, PCI DSS certifications
Free plan	$0/month, 45,000 credits	$200 free credit, then pay-as-you-go
Founders	Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez, co-founders of Kyutai	Scott Stephenson, Noah Shutty, and Adam Sypniewski

Who Is Deepgram?

Deepgram is a voice AI company best known for its Speech-To-Text models and transcription infrastructure. Nova-3 is its flagship STT model, widely used for real-time and batch transcription across 45+ languages. Flux is positioned for conversational voice agents and documents turn detection and interruption handling. Deepgram has expanded its surface to include Aura (Aura-1, Aura-2) for Text-To-Speech, a unified Voice Agent API that bundles STT, TTS, and LLM orchestration, and Audio Intelligence features (summarization, sentiment analysis, intent recognition, topic detection). Deepgram supports cloud SaaS and self-hosted enterprise deployment, with SOC 2 Type II, HIPAA, GDPR, CCPA, and PCI DSS certifications.

Who Is Gradium?

Gradium is a real-time voice AI platform for developers and companies deploying voice agents, built by researchers and co-founded by Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez. Its product surface is:

A streaming Text-To-Speech API over WebSocket
A streaming Speech-To-Text API with semantic voice activity detection
A voice cloning API, available as Instant (zero-shot, 10 seconds of audio) and Pro (fine-tuned model)

The TTS and STT APIs share a streaming-first WebSocket architecture, suitable for bidirectional, low-latency communication in production. Voice cloning is exposed via REST.

The founding team previously co-founded Kyutai, a research lab with peer-reviewed work on audio language models. Kyutai released world-first open systems including Moshi (real-time Speech-To-Speech) and Hibiki (live Speech-To-Speech translation).

What Should You Look for in a Deepgram Alternative?

TTS

Streaming-first TTS, designed alongside STT. Deepgram's Aura TTS was added to a platform whose primary lineage is transcription. Gradium's Text-To-Speech was built alongside the STT from the ground up.
Pronunciation robustness on complex inputs. Voice agents have to pronounce phone numbers, URLs, email addresses, dates, times, street addresses, order IDs, confirmation codes, and named entities correctly on the first attempt. Gradium's TTS is tuned for these cases. You can further fine-tune pronunciation using pronunciation dictionaries and text normalization rules.
TTFA latency at or below 300 ms, delivered over streaming. Under 300 ms end-to-end, combined with a true streaming API, is the practical threshold for natural-feeling turn-taking. Gradium's published benchmark measures P50 258 ms end-to-end (214 ms excluding connection establishment). For high-throughput pipelines, Gradium also supports multiplexing multiple TTS requests over a single WebSocket connection.

STT

Semantic voice activity detection. Semantic VAD determines when a speaker has finished a complete thought, not just gone silent. It is native to Gradium's STT. Deepgram's Flux is designed for conversational voice agents and uses semantic and acoustic cues for end-of-turn detection, though it is not marketed under the "semantic VAD" label.
Streaming STT. Both vendors offer real-time streaming STT over their respective APIs.

Voice Cloning

Availability of voice cloning. Deepgram does not offer a voice cloning product. For applications that require branded voices, AI companions, or a consistent voice identity across sessions, this is a structural limitation. Gradium offers Instant and Pro clones.

Platform

Composable voice models vs a bundled Voice Agent API. Deepgram's Voice Agent API bundles STT, TTS, and LLM orchestration into a single surface. Gradium builds only voice models and APIs, and integrates into the orchestration layer you choose (LiveKit, Pipecat, and others). Teams that want to compose their own stack, or that already run an orchestration layer they do not want to migrate, tend to prefer the platform-neutral option.
Audio Intelligence. If summarization, sentiment analysis, intent recognition, or topic detection are core requirements alongside transcription, Deepgram is the stronger fit. Gradium's STT is focused on transcription and semantic VAD.
Enterprise and deployment readiness. Private cloud, on-premise, on-device options, zero data retention, SLA commitments, certifications, and concurrency guarantees matter at production scale. Both vendors offer enterprise deployment; the specific certifications and surfaces differ.
Pricing. Gradium publishes credit-based plans from Free ($0/month, 45,000 credits) to Tailored. Deepgram starts with $200 in free credits, then pay-as-you-go.

Switching from Deepgram. Gradium's WebSocket TTS and STT accept the same streaming pattern: open a connection, send data, receive results. If you are already using Deepgram's streaming APIs, switching to Gradium requires updating the endpoint, model selection, and authentication. The streaming flow is the same, and the json_config parameter gives you additional control over pronunciation, speed, and expressiveness that you can tune after migration.

What Are the Key Differences Between Gradium and Deepgram?

Voice Cloning: Available on Gradium, Not on Deepgram

Voice cloning is the clearest functional gap between the two. Deepgram's public product surface does not include voice cloning. For teams shipping branded voices, personalized AI companions, or consistent voice identity across sessions, this is a structural blocker on Deepgram.

Gradium offers two cloning tiers:

Instant Voice Clone. Generates a custom voice from as little as 10 seconds of audio, immediately usable for TTS generation via the API.
Pro Voice Clone. A fine-tuned model trained on more audio data, designed to be indistinguishable from the original speaker. Gradium positions Pro Voice Clones as the highest speaker-similarity option on the market.

Gradium's voice cloning also preserves the accent and speaking style of the reference sample within each supported language, rather than defaulting to a single standard pronunciation. Coverage spans major regional accents in English, French, Spanish, German, and Portuguese, and speaking styles including conversational, narrative, broadcast, customer-service, expressive, and whispered delivery. Read more about Instant vs Pro Voice Cloning in Gradium.

Gradium has published a voice cloning benchmark of Instant Voice Clone quality: 890 sentences per language spanning three complexity levels (from simple conversational questions to sentences with rare named entities, URLs, email addresses, and alphanumeric codes), 20 unique voices per language, 10 seconds of source audio per clone, blinded A/B listening tests, 3,220 voice pairs evaluated, and a live Elo ranking. Gradium's Instant Voice Clone achieved the highest Elo score in every language evaluated (English, French, Spanish, German). Deepgram is not in the benchmark because it does not offer a voice cloning product.

How Does Gradium's STT with Semantic VAD Compare to Deepgram Flux?

In a real-time voice pipeline, turn-taking quality depends on detecting when the user has finished a complete thought, not just when they have gone silent. Semantic VAD is the mechanism that makes this possible. Without it, voice agents fall back on silence thresholds, which produce premature cut-offs or unnatural pauses.

Gradium's STT ships semantic VAD natively. Deepgram's Flux is designed for conversational voice agents and uses semantic and acoustic cues for end-of-turn detection, though it is not marketed under the "semantic VAD" label.

How Does Gradium's TTS Compare to Deepgram Aura?

Deepgram's primary lineage is transcription. Aura TTS was added on top of an STT-first platform. Gradium's TTS and STT were designed together from the start, sharing a single streaming WebSocket architecture. This matters in a voice agent pipeline where turn-taking, partial transcripts, and incremental audio playback all depend on a coherent streaming model.

Gradium publishes a TTFA benchmark with matched methodology across providers, measured from Paris on a 15-25 word sentence over WebSocket, 100 queries, warm state: P50 258 ms end-to-end and P50 214 ms excluding connection establishment. The benchmark clears the 300 ms streaming threshold for natural turn-taking with headroom.

How Do the Platform Approaches Differ?

Deepgram's Voice Agent API bundles STT, TTS, and LLM orchestration into a single surface. Some teams prefer the convenience. Others prefer to compose their own voice agent stack, or already run an orchestration layer they do not want to migrate away from.

Gradium builds voice models and APIs; it does not build its own voice agent platform. The focus is to enable every voice platform equally well, with first-class integrations into LiveKit, Pipecat, and the other orchestration layers teams are already using in production. Deepgram also integrates with external frameworks, but its Voice Agent API positions the bundled surface as the primary agent experience.

How Do Gradium and Deepgram Compare on Language Support?

Deepgram's Nova-3 supports 45+ languages for STT, one of the broadest coverages in the market. For transcription workflows that need to render audio in many languages at scale, Deepgram is the stronger fit.

Gradium supports five languages with native fluency across TTS, STT, and voice cloning: English, French, Spanish, German, and Portuguese. Gradium adds mid-sentence code-switching across all five, with no latency penalty and no quality degradation. A speaker can shift language within a single sentence and Gradium handles it without a quality drop. For voice agents in these five languages, Gradium goes deeper on accent preservation, mid-sentence code-switching, and voice-agent-tuned pronunciation.

Audio Intelligence: Deepgram Adds It, Gradium Does Not

Deepgram offers Audio Intelligence on top of transcription: summarization, sentiment analysis, intent recognition, and topic detection. Gradium's STT is focused on transcription and semantic VAD; these downstream analysis features are not part of the Gradium API surface. If audio analysis is a core requirement alongside transcription, Deepgram is the stronger fit.

Gradium TTS: Streaming Text-To-Speech for Real-Time Applications

Gradium's TTS is built for streaming delivery. It connects via WebSocket and supports bidirectional communication, which is the architecture required for voice agents that must speak while staying ready for the next user input.

Capabilities:

Real-time streaming via WebSocket. Audio is delivered incrementally as it is generated, not after the full sentence is complete.
Expressive speech with robust pronunciation. Designed to handle phone numbers, URLs, email addresses, dates, times, and named entities, the inputs that break most TTS models in agent pipelines.
Published TTFA benchmark. P50 258 ms end-to-end, P50 214 ms excluding connection establishment, measured on matched methodology against ElevenLabs Turbo v2.5, Flash v2.5, Multilingual v2, Mistral Voxtral TTS, and OpenAI GPT-4o Mini.
High-precision word-level timestamps. Useful for subtitling, lipsync, and interactive transcript display.
Multiple output formats for different integration surfaces.
Advanced configuration via json_config for controlling speed, expressiveness, voice similarity, and text normalization.

The TTS API is available via Python SDK, Rust SDK, and direct WebSocket integration, and is compatible with LiveKit and Pipecat.

Gradium STT: Speech-To-Text with Semantic VAD

Gradium's STT combines streaming transcription with semantic VAD, a mechanism that determines when a speaker has finished a thought, not just stopped making sound.

In conversational AI, a standard VAD cuts off after a silence threshold, so the system either interrupts mid-sentence or waits too long. Semantic VAD uses the intent of the utterance to trigger turn-taking at the right moment.

Capabilities:

Best-in-class accuracy with controllable latency
Robust performance in noisy environments, designed for real-world deployment
Semantic VAD for smart turn-taking
Streaming via WebSocket

Voice Cloning: Instant and Pro

Voice cloning is the clearest differentiator between Gradium and Deepgram. Deepgram does not offer a voice cloning product. Gradium offers two tiers.

Instant Voice Clone. Create a custom voice from as little as 10 seconds of audio. The clone is immediately available for TTS generation via the API. All paid plans include up to 1,000 Instant Voice Clones per month. In a blinded human evaluation benchmark, Gradium's Instant Voice Clone achieved the highest Elo score across English, French, Spanish, and German.

Pro Voice Clone. A fine-tuned model trained on more audio, designed to be indistinguishable from the original speaker. Gradium positions Pro Voice Clones as the highest speaker-similarity option on the market. Pro clones are available from the M plan ($340/month, 5 included) and L plan ($1,615/month, 20 included).

Both clone tiers preserve the accent and speaking style of the reference sample within each of Gradium's five supported languages.

Both clone types are accessible via the REST API, the Python SDK, or Gradium Studio. Explicit consent is required before cloning any voice.

Languages: Native Fluency Across Five Languages

Gradium supports five languages with native fluency across TTS, STT, and voice cloning: English, French, Spanish, German, and Portuguese.

Mid-sentence code-switching is supported across all five, with no latency penalty and no quality degradation. This is relevant for multilingual AI companions, international customer support agents, and language learning applications.

For deployments that require STT coverage across a broader set of languages, Deepgram's Nova-3 supports 45+ languages and is the stronger fit, particularly for transcription at scale.

Deployment Options

Gradium supports four deployment surfaces from the same model and API:

Cloud marketplace. Fastest path to production with standard SaaS consumption.
Private cloud. Tenant-isolated deployment inside the customer's cloud account.
On-premise. For teams with strict data-residency, regulatory, or air-gapped requirements (healthcare, financial services, defense).
On-device. For latency-critical or offline use cases where the model runs locally on the end-user device.

Enterprise plans also include zero data retention and SLA commitments. Deepgram offers cloud SaaS and self-hosted enterprise deployment, with SOC 2 Type II, HIPAA, GDPR, CCPA, and PCI DSS certifications. Gradium's differentiator is the breadth of explicitly supported surfaces, in particular on-device, delivered from a single model and API.

Gradium Pricing

Gradium offers a free tier and credit-based paid plans starting at $13/month. For full pricing details, see the Gradium pricing page.

Gradium also runs a Startup Program: seed-funded startups can apply for $2,000+ in free credits, 6 months of full API access, direct engineering support, and early model access.

Who Should Choose Gradium Over Deepgram?

Choose Gradium if you are:

Voice-cloning-driven.

Shipping an application that needs voice cloning (branded voices, AI companions, personalized voice agents), since Deepgram does not offer voice cloning
Cloning voices that need to preserve a specific accent or speaking style within a language

STT-driven.

Building conversational AI agents where natural turn-taking is the key quality driver, and native semantic VAD is a hard requirement

TTS-driven.

Shipping a voice agent that needs robust pronunciation on complex inputs (phone numbers, URLs, email addresses, dates, times, street addresses, order IDs, named entities) across English, French, Spanish, German, and Portuguese
Evaluating TTS with a published, reproducible TTFA latency benchmark

Platform-driven.

Working in Python or Rust and want officially supported SDKs
Integrating with LiveKit or Pipecat
Composing your own voice agent stack rather than adopting a bundled Voice Agent API
Deploying across multiple surfaces (cloud marketplace, private cloud, on-premise, on-device) from a single model and API
Requiring enterprise-grade data control, with zero data retention and SLA commitments on enterprise plans
A seed-funded startup looking for production-grade voice AI with onboarding credits of $2,000+ and 6 months of full API access

Also comparing ElevenLabs or Cartesia? See our dedicated comparison pages.

Deepgram remains the stronger choice if:

Your primary requirement is STT accuracy across 45+ languages, particularly for transcription at scale
You need Audio Intelligence features (summarization, sentiment analysis, intent recognition, topic detection)
You want an all-in-one Voice Agent API with built-in LLM orchestration
You need the specific certifications Deepgram holds (SOC 2 Type II, HIPAA, GDPR, CCPA, PCI DSS)