Azure TTS Alternative: Gradium for Real-Time Voice AI

15 min read

Gradium is a real-time voice AI platform built by the co-founders of Kyutai. It offers streaming Text-To-Speech and streaming Speech-To-Text with semantic voice activity detection (VAD) over WebSocket, plus a REST voice cloning API that produces an instant clone from 10 seconds of audio. It is built specifically for voice agents rather than as part of a broader cloud speech platform, which is the main reason teams evaluate it against Microsoft Azure Text-To-Speech.

Who is this for. Developers and technical teams running on Azure AI Speech (formerly Azure Cognitive Services Speech) who are evaluating a provider whose stack is purpose-built for real-time voice agents: a streaming TTS, a streaming STT with semantic VAD for turn-taking, instant voice cloning measured in seconds rather than weeks, and a provider-agnostic WebSocket API with published TTFA benchmarks against the broader market.

How Do Gradium and Azure TTS Compare at a Glance?

Dimension Gradium Azure TTS
Primary use case Real-time voice agents and developer voice APIs Cloud speech service inside Azure AI Speech, broad TTS coverage from accessibility to IVR
TTS model TTS (streaming), designed for voice agents with robust handling of complex inputs (phone numbers, URLs, email addresses, dates, times, street addresses, order IDs, named entities) Azure Neural TTS (400+ neural voices), streaming via the Azure Speech SDK
TTS latency (TTFA) P50 258 ms, P95 274 ms end-to-end; P50 214 ms, P95 228 ms excluding connection establishment (published benchmark) Streaming supported via the Speech SDK; no equivalent published P50/P95 TTFA
Word-level timestamps (TTS) Yes, high-precision Yes (SSML word boundary events)
STT model STT (streaming) Azure Speech-To-Text (part of Azure AI Speech)
Semantic VAD Yes, included in the STT Not documented as a core feature
Voice cloning Instant Voice Cloning + Professional Voice Cloning, instant clone from 10 seconds of audio, immediately usable via the API Custom Neural Voice: formal data collection, training pipeline, Microsoft approval, typically days to weeks
Voice library Curated library of voices suited for voice agents 400+ neural voices
Languages English, French, Spanish, German, Portuguese, with regular updates 140+ languages
Streaming API WebSocket-native, REST for voice cloning Streaming via Azure Speech SDK
SDKs Python, Rust (official) Azure Speech SDK (multiple languages)
Deployment options Cloud marketplace, private cloud, on-premise, on-device, from the same model and API Azure cloud, Azure Government, containerized deployment for select scenarios
Enterprise data control Zero data retention, SLA commitments (enterprise plans) Azure-native data handling, HIPAA, GDPR, SOC certifications
Free plan $0/month, 45,000 credits Free F0 tier on Azure, character-limited
Founders Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez, co-founders of Kyutai Microsoft (Azure AI division)

Who Is Azure TTS?

Azure Text-To-Speech is the speech synthesis component of Microsoft Azure AI Speech (formerly Azure Cognitive Services Speech). It provides 140+ languages, 400+ neural voices, full SSML support, and deep integration with the broader Azure ecosystem including Azure AD, Azure Functions, Azure Bot Service, Power Platform, and Microsoft Teams. Custom Neural Voice allows enterprises to train a branded voice model through a formal data collection and training pipeline, with Microsoft approval required before production use. Azure TTS is widely deployed in accessibility tooling, content narration, IVR systems, and enterprise voice applications where breadth of language coverage and Azure-native integration are the decisive factors.

Who Is Gradium?

Gradium is a real-time voice AI platform for developers and companies deploying voice agents, built by researchers and co-founded by Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez. Its product surface is:

  • A streaming Text-To-Speech API over WebSocket
  • A streaming Speech-To-Text API with semantic voice activity detection
  • A voice cloning API, available as Instant (zero-shot, 10 seconds of audio) and Pro (fine-tuned model)

The TTS and STT APIs share a streaming-first WebSocket architecture, suitable for bidirectional, low-latency communication in production. Voice cloning is exposed via REST. The founding team previously co-founded Kyutai, a research lab with peer-reviewed work on audio language models. Kyutai released world-first open systems including Moshi and Hibiki. Gradium translates that research into production infrastructure.

What Should You Look for in an Azure TTS Alternative?

TTS

  • Pronunciation robustness on complex inputs. Voice agents have to say phone numbers, URLs, email addresses, dates, times, street addresses, order IDs, and named entities correctly on the first attempt. Gradium's TTS is specifically tailored for these cases. You can further fine-tune pronunciation using pronunciation dictionaries and text normalization rules.
  • TTFA at or below 300 ms, delivered over streaming. Gradium's published benchmark measures P50 258 ms end-to-end and P50 214 ms excluding connection establishment.

STT

  • Semantic voice activity detection. Knowing when a user has stopped speaking is not the same as knowing when they have finished a thought. Semantic VAD is the layer that enables natural turn-taking. It is native to Gradium's STT. Azure Speech-To-Text exposes silence-based endpointing with configurable timeouts, not semantic-level turn-taking.
  • Streaming STT over WebSocket. Gradium's STT streams over WebSocket by default, with a single integration path that mirrors the TTS API.

Voice Cloning

  • Instant cloning from 10 seconds of audio, available immediately. Gradium's Instant Voice Cloning produces a usable voice from a 10-second sample with no training pipeline and no approval step. Pro Voice Cloning offers higher-fidelity, fine-tuned models for branded deployments. Azure Custom Neural Voice is purpose-built for large-scale branded deployments but operates on a different timescale (days to weeks) and requires Microsoft approval.
  • Accent and style preservation. Gradium's cloning preserves the accent and speaking style of the reference speaker within each supported language, rather than defaulting to a single standard pronunciation. See Instant vs Pro Voice Cloning for the full comparison.

Platform

  • Integrated TTS, STT, and VAD from one provider. A production voice pipeline benefits from streaming TTS, streaming STT, and semantic VAD in a single integrated stack with consistent latency characteristics.
  • Provider-agnostic. Gradium runs on any cloud and integrates natively with orchestration layers like LiveKit and Pipecat. There is no dependency on Azure-specific services, Azure AD, or Azure billing.
  • Transparent pricing. Subscription tiers published on gradium.ai/pricing, with no separate line items for voice cloning, no regional pricing variations, and no SDK licensing fees.
  • Deployment range from cloud to on-device. Cloud marketplace, private cloud, on-premise, on-device: all from the same model and API.

Switching from Azure TTS. Gradium's WebSocket TTS accepts the same streaming pattern voice agents already expect: open a connection, send text, receive audio chunks. If you are currently using the Azure Speech SDK, switching to Gradium means moving from an SDK-mediated streaming integration to a direct WebSocket connection, with no Azure credentials, no Azure AD, and no Azure-specific libraries. The json_config parameter gives you additional control over pronunciation, speed, and expressiveness that you can tune after migration. For high-throughput pipelines, Gradium also supports multiplexing multiple TTS requests over a single WebSocket connection.

What Are the Key Differences Between Gradium and Azure TTS?

How Does Gradium's TTS Compare to Azure for Voice Agents?

Azure Text-To-Speech is a general-purpose neural TTS service that covers accessibility, content narration, IVR, and enterprise voice applications across 140+ languages. Streaming is supported through the Azure Speech SDK, but the service was not architected around minimizing time to first audio as a primary constraint.

Gradium's TTS is built for streaming voice agents from the first design choice. On latency, Gradium publishes a full TTFA benchmark with matched methodology across providers, measured from Paris on a 15-25 word sentence over WebSocket, 100 queries, warm state: P50 258 ms end-to-end and P50 214 ms excluding connection establishment. Azure does not publish equivalent P50/P95 TTFA data for its real-time TTS endpoints.

On pronunciation, voice agents rarely speak in clean prose. They have to say phone numbers, dates, times, URLs, email addresses, order IDs, confirmation codes, street addresses, dollar amounts, and named entities, and they have to say them correctly on the first attempt. Gradium's TTS is tuned specifically for these cases. Phone numbers are grouped and intoned naturally instead of read as a flat digit string. URLs and email addresses are spelled out with correct handling of domains, dots, dashes, slashes, and special characters. Dates and times are pronounced in the regional convention of the target language.

How Does Gradium's STT with Semantic VAD Compare to Azure Speech-To-Text?

Azure Speech-To-Text is a robust transcription service with broad language coverage and Azure-native integration. Endpointing is silence-based by default, with configurable timeouts.

Gradium's STT ships semantic VAD natively. In a real-time voice pipeline, turn-taking quality depends on detecting when the user has finished a complete thought, not just when they have gone silent. Without semantic VAD, voice agents fall back on silence thresholds, which produce premature cut-offs or unnatural pauses. Semantic VAD uses the intent of the utterance to trigger turn-taking at the right moment.

Voice Cloning: Instant on Gradium vs Custom Neural Voice on Azure

Azure Custom Neural Voice is a high-fidelity branded voice solution. It requires a formal data collection process, a training pipeline, and Microsoft approval before production use. The timescale is days to weeks. This is well-suited for large-scale branded deployments where the voice is a long-term investment.

Gradium offers two tiers. Instant Voice Cloning produces a usable voice from a 10-second sample, available immediately via the API with no training step. Professional Voice Cloning is a fine-tuned model trained on more audio, designed to be indistinguishable from the original speaker. Both tiers preserve the accent and speaking style of the reference sample within each of Gradium's five supported languages.

For dynamic voice creation at the product level, instant cloning changes what is architecturally possible: voice agents that sound like a specific user, characters generated at runtime, personalized assistants.

How Do Gradium and Azure Compare on Language Support?

Azure TTS supports 140+ languages and 400+ neural voices, one of the broadest coverages in the market. If your application requires languages outside the five Gradium supports, Azure remains the stronger fit on coverage alone.

Gradium supports five languages with native fluency: English, French, Spanish, German, and Portuguese. Gradium adds mid-sentence code-switching across all five, with no latency penalty and no quality degradation. A speaker can shift language within a single sentence and Gradium handles it without a quality drop.

How Do Deployment Options Compare?

Azure TTS is available across Azure cloud, Azure Government for regulated workloads, and containerized deployments for specific scenarios. The data-handling model is Azure-native, with HIPAA, GDPR, and SOC certifications.

Gradium is available across four deployment surfaces from the same model and API:

  • Cloud marketplace for standard SaaS consumption.
  • Private cloud for tenant-isolated deployments inside a customer's account.
  • On-premise for teams with strict data-residency or air-gapped requirements.
  • On-device for latency-critical or offline use cases where the model runs locally.

A pilot that ships on the cloud can move to on-premise or on-device without re-architecting the pipeline.

Pricing Transparency: Subscription Tiers vs Per-Character With Variants

Azure TTS pricing varies by voice type (standard, neural, custom), by region, and by whether you use the SDK or the REST API. Custom Neural Voice involves separate line items for training and inference. For teams modeling unit economics before scaling, this complexity adds friction.

Gradium's pricing is structured in subscription tiers published on gradium.ai/pricing. No separate line items for voice cloning, no regional surcharges, no SDK licensing fees.

Gradium TTS: Streaming Text-To-Speech for Real-Time Applications

Gradium's TTS is built for streaming delivery. It connects via WebSocket and supports bidirectional communication, which is the architecture required for voice agents that must speak while staying ready for the next user input.

Capabilities:

  • Real-time streaming via WebSocket. Audio is delivered incrementally as it is generated, not after the full sentence is complete.
  • Expressive speech with robust pronunciation. Designed to handle phone numbers, URLs, email addresses, dates, times, and named entities, the inputs that break most TTS models in agent pipelines.
  • Published TTFA benchmark. P50 258 ms end-to-end, P50 214 ms excluding connection establishment, measured on matched methodology against ElevenLabs Turbo v2.5, Flash v2.5, Multilingual v2, Mistral Voxtral TTS, and OpenAI GPT-4o Mini.
  • High-precision word-level timestamps. Useful for subtitling, lipsync, and interactive transcript display.
  • Advanced configuration via json_config for controlling speed, expressiveness, voice similarity, and text normalization.

The TTS API is available via Python SDK, Rust SDK, and direct WebSocket integration, and is compatible with LiveKit and Pipecat.

Gradium STT: Speech-To-Text with Semantic VAD

Gradium's STT combines streaming transcription with semantic VAD, a mechanism that determines when a speaker has finished a thought, not just stopped making sound. In conversational AI, a standard VAD cuts off after a silence threshold, so the system either interrupts mid-sentence or waits too long. Semantic VAD uses the intent of the utterance to trigger turn-taking at the right moment.

Capabilities:

  • Best-in-class accuracy with controllable latency
  • Robust performance in noisy environments, designed for real-world deployment
  • Semantic VAD for smart turn-taking
  • Streaming via WebSocket

Voice Cloning: Instant and Pro

Gradium offers two voice cloning tiers.

Instant Voice Clone. Create a custom voice from as little as 10 seconds of audio. The clone is immediately available for TTS generation via the API. All paid plans include up to 1,000 Instant Voice Clones per month.

Pro Voice Clone. A fine-tuned model trained on more audio, designed to be indistinguishable from the original speaker. Gradium positions Pro Voice Clones as the highest speaker-similarity option on the market. Pro clones are available from the M plan ($340/month, 5 included) and L plan ($1,615/month, 20 included).

Both clone tiers preserve the accent and speaking style of the reference sample within each of Gradium's five supported languages. Both clone types are accessible via the REST API, the Python SDK, or Gradium Studio. Read more about Instant vs Pro Voice Cloning.

Languages: Native Fluency Across Five Languages

Gradium supports five languages with native fluency: English, French, Spanish, German, and Portuguese. Mid-sentence code-switching is supported across all five, with no latency penalty and no quality degradation. This is relevant for multilingual AI companions, international customer support agents, and language learning applications.

For deployments that require coverage across a broader set of languages, Azure TTS supports 140+ languages and is the stronger fit, particularly for accessibility, content narration, and IVR systems with wide language requirements.

Deployment Options

Gradium supports four deployment surfaces from the same model and API:

  • Cloud marketplace. Fastest path to production with standard SaaS consumption.
  • Private cloud. Tenant-isolated deployment inside the customer's cloud account.
  • On-premise. For teams with strict data-residency, regulatory, or air-gapped requirements (healthcare, financial services, defense).
  • On-device. For latency-critical or offline use cases where the model runs locally on the end-user device.

Enterprise plans also include zero data retention and SLA commitments. Azure TTS offers enterprise deployment options through Azure cloud, Azure Government, and containerized deployments. Gradium's differentiator is the breadth of explicitly supported surfaces, in particular on-device, delivered from a single model and API.

Gradium Pricing

Gradium offers a free tier and credit-based paid plans starting at $13/month. For full pricing details, see the Gradium pricing page.

Gradium also runs a Startup Program: seed-funded startups can apply for $2,000+ in free credits, 6 months of full API access, direct engineering support, and early model access.

Who Should Choose Gradium Over Azure TTS?

Choose Gradium if you are:

  • Shipping a voice agent that needs robust pronunciation on complex inputs (phone numbers, URLs, email addresses, dates, times, street addresses, order IDs, named entities) across English, French, Spanish, German, and Portuguese, with a published, reproducible TTFA benchmark
  • Building conversational AI agents where natural turn-taking is the key quality driver, and native semantic VAD is a hard requirement
  • Voice-cloning-driven. Needing dynamic, near-real-time voice cloning from a short audio sample, with no training pipeline or approval step
  • Integrating with LiveKit or Pipecat and want a voice layer that connects natively
  • Deploying across multiple surfaces (cloud marketplace, private cloud, on-premise, on-device) from a single model and API
  • Requiring enterprise-grade data control, with zero data retention and SLA commitments on enterprise plans, without Azure-native lock-in
  • A seed-funded startup looking for production-grade voice AI with onboarding credits of $2,000+ and 6 months of full API access

Also comparing ElevenLabs, Cartesia, or Deepgram? See our dedicated comparison pages.

Azure TTS remains the stronger choice if:

  • Your product requires coverage across 140+ languages
  • Your infrastructure is entirely Azure-native and you benefit from unified billing, Azure AD, and Azure-native data residency
  • You need Custom Neural Voice for a large-scale branded deployment with a long-term training investment
  • You require deep integration with the Azure ecosystem (Azure Bot Service, Microsoft Teams, Power Platform)

Frequently Asked Questions