What is on-device text-to-speech?

On-device text-to-speech generates spoken audio from text entirely on the end user's device, without sending data to an external server. The TTS model runs locally on CPU and does not require a network connection. This differs from cloud TTS, where text is sent to a remote server, audio is generated on GPU infrastructure, and the result is returned to the application.

When should I choose on-device TTS over a cloud API?

Three deployment contexts favor on-device TTS. First, high-volume consumer applications where per-request cloud pricing does not scale economically, and a fixed license model per device reduces cost structurally. Second, offline and low-connectivity environments where voice must function without network access. Third, data privacy and compliance contexts where text input cannot leave the user's device regardless of server-side agreements.

How accurate is on-device TTS compared to cloud TTS?

On the Seed-TTS English benchmark with 1,008 utterances, Gradium Phonon achieves 1.48 percent WER and 56.37 percent speaker similarity, outperforming competing on-device models including Kani-TTS2 with 450M parameters at 4.97 percent WER and NeuTTS Air with 552M parameters at 2.18 percent WER, both significantly larger. On-device models are finetuned for a specific scope and quality comparisons are most meaningful within the designed use case.

What hardware does on-device TTS require?

Gradium Phonon runs inference on a single CPU core with approximately 100M parameters. It fits in mobile device memory, runs at 6x real-time on a single MacBook CPU core, and is compatible with Android, iOS, and browser environments. No GPU is required for inference. This is a practical differentiator from larger on-device models above 450M parameters that may exceed mobile memory and compute budgets.

What is the cost model for on-device TTS?

On-device TTS uses a license model: a fixed cost per device and model type, with unlimited generations. This contrasts with cloud TTS, which charges per character or per request. The license model becomes cost-advantaged when per-user generation volume is high relative to the cloud variable cost.

Does on-device TTS support voice cloning?

Gradium Phonon supports voice cloning from a 10-second reference audio sample, embedded in the finetuned model. On the Seed-TTS benchmark, Phonon achieves 56.37 percent speaker similarity, the highest of all compared on-device models. Unlike competing models such as NeuTTS Air and NeuTTS Nano, Phonon does not require a transcription of the reference audio for cloning.

Is on-device TTS available for production use today?

Gradium Phonon is currently in private beta. Partners apply to define the scope (language, voice, target devices), and receive a finetuned model artifact in days to weeks. The model ships as a self-contained binary inside the partner's application with no external runtime dependencies. The cloud Gradium API is available for production use at scale today.

Which on-device TTS model has the lowest WER in 2026?

On the Seed-TTS English benchmark with 1,008 utterances, Gradium Phonon achieves the lowest Word Error Rate at 1.48 percent, ahead of NeuTTS Nano at 1.71 percent, NeuTTS Air at 2.18 percent, and Kani-TTS2 at 4.97 percent. Phonon is also the smallest model in the comparison at approximately 100M parameters.

Can on-device TTS replace cloud TTS for voice agents?

No. On-device TTS and cloud TTS solve different problems. On-device is the right choice for high-volume consumer applications, offline environments, and privacy-constrained deployments where the model needs to be scoped to a specific voice, language, and use case. Cloud TTS remains the correct architecture for real-time voice agents that require broad voice flexibility, multilingual coverage with mid-sentence code-switching, and continuous model updates.

Does on-device TTS work on iOS and Android?

Gradium Phonon runs on CPU across Android, iOS, and browser environments. It is delivered as a self-contained binary inside the partner's application, with no external endpoint calls at runtime and no GPU requirement.

How is on-device TTS priced?

Gradium Phonon uses a license model with a fixed cost per device and model type, and unlimited generations. Pricing is set per deployment scope (voice, language, target devices) and is discussed with partners during the private beta. The cloud Gradium API uses subscription tiers starting at zero dollars per month with 45,000 free credits.

Gradium was co-founded by Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez, who previously co-founded Kyutai. Kyutai released world-first open systems including Moshi and Hibiki.

Where can I get started with Gradium Phonon?

Gradium Phonon is in private beta. Partners can apply through gradium.ai by describing the target language, voice, and device list. The cloud Gradium API is available for immediate signup with a free tier at gradium.ai.

On-Device Text-to-Speech in 2026: When Edge TTS Is the Right Architecture

Most Text-To-Speech deployments in 2026 run on cloud APIs: a request goes out over the network, a GPU generates the audio, and the audio comes back. For the majority of production voice use cases, this is the right architecture: highest available synthesis quality, unlimited voice flexibility, continuous model improvements, and a simple integration.

For some use cases, it is not the right architecture. This guide explains when on-device Text-To-Speech is the correct choice, what the tradeoffs are relative to cloud, and what the current state of on-device TTS quality looks like, with Gradium Phonon as the reference implementation and published benchmark data.

What Is the Core Tradeoff Between Cloud TTS and On-Device TTS?

The distinction is not about quality in absolute terms. It is about what each architecture optimizes for.

Cloud TTS (the Gradium API, ElevenLabs, Deepgram Aura-2) runs on GPU infrastructure. The model handles any language, any voice, any content type, and updates continuously. Quality is high across a wide range of inputs. The cost model is variable: each generation is a billable request. Network access is required.

On-device TTS runs locally on the end user's hardware, typically on CPU. The model is compact, optimized for a specific voice and use case, and ships as a binary inside the application. No network request, no per-generation cost, no data leaving the device. The tradeoff is scope: the model does one thing well rather than everything adequately.

When Is On-Device TTS the Right Architecture?

Three deployment contexts favor on-device over cloud.

High-Volume Consumer Applications

Cloud TTS pricing scales with usage. For enterprise voice agents, this is expected: every call is a revenue-generating interaction and the per-request cost is justified.

Consumer applications with freemium tiers operate differently. A large base of free or low-paying users interacting with voice features generates significant TTS volume, much of which produces little or no direct revenue. At scale, variable cloud pricing becomes a structural constraint.

On-device TTS uses a license model: a fixed cost per device and model type, with unlimited generations. The break-even point between cloud variable cost and on-device license cost depends on the cloud provider's per-character price and per-user generation volume. For high-engagement consumer apps, on-device becomes cost-advantaged beyond a threshold that is often reached before profitability.

Relevant use cases: language learning apps, accessibility tools, consumer voice assistants, content or audiobook apps with high per-user playback volume.

Offline and Low-Connectivity Environments

Cloud TTS requires a functioning network connection. Every generation is a round-trip: text goes to the server, audio comes back. In conditions where connectivity is unreliable or unavailable, cloud TTS fails.

On-device TTS runs entirely locally. There is no network round-trip, no dependency on signal quality, no failure mode when coverage drops. Audio generates from the model on-device and plays immediately.

Relevant use cases: navigation and field service applications in areas with poor network coverage; consumer devices designed to function in airplane mode; rural or low-connectivity markets where network reliability cannot be assumed; applications that need voice to function continuously regardless of connection state.

Data Privacy and Compliance Constraints

Cloud TTS requires sending text to an external server. For most applications, this is unproblematic. For some, it is not permitted.

Legal, enterprise, or compliance contexts may require that text never leaves the user's device. On-device TTS is the only architecture that guarantees this by design: there is no outbound request, and no external system receives the input text. This is structurally different from a cloud provider holding a HIPAA BAA or GDPR agreement, because those arrangements still involve data leaving the device. On-device means the data does not leave at all.

Relevant use cases: enterprise applications with strict data residency requirements; healthcare applications where patient-adjacent text cannot be transmitted externally; applications in regulated markets where data sovereignty is a hard requirement; consumer apps where users expect their input to remain local.

How Do On-Device TTS Models Work?

A cloud TTS model is built for breadth: it must handle any language, any voice, any content type. The architecture that supports this breadth requires significant compute, which is why cloud TTS runs on server-side GPUs.

An on-device TTS model is built for a specific scope. Each Gradium Phonon deployment is finetuned for one voice, one target language, and one content type defined by the specific use case. Because the model only needs to handle inputs within that scope, it can be compact while producing high-quality output for the context it was built for.

The practical implication: on-device TTS quality should be evaluated within scope. A Phonon model built for NPC dialogue in a mobile game will produce high-quality audio for NPC dialogue. It will not produce the same range and flexibility as a cloud API. That is the intended design for the deployment context.

What Are the Specifications of Gradium Phonon?

Gradium Phonon is Gradium's on-device Text-To-Speech model. It runs on CPU across Android, iOS, and browser environments, with no network dependency and no GPU requirement.

Technical specifications:

Parameters: approximately 100M
Runtime: 6x real-time on a single CPU core
Voice cloning: from a 10-second reference audio sample
Deployment: ships as a self-contained binary inside the application
Network dependency: none
Cost model: license per device and model type, unlimited generations

Each Phonon model is built for a specific combination of voice, language, and use case. Partners define the scope (target language, voice, device list) and receive a finetuned model artifact in days to weeks. The model ships directly inside the partner's app with no external endpoint calls at runtime. Gradium Phonon is currently in private beta.

How Does Phonon Compare to Other On-Device TTS Models?

Benchmark data is from Gradium's published evaluation on the Seed-TTS English test set (1,008 utterances). Two metrics:

WER (Word Error Rate): generated speech transcribed back to text using Whisper large v3, compared to source via edit distance (jiwer). Lower is better.
Speaker Similarity: cosine distance between speaker embeddings extracted from reference audio and generated audio using WavLM large. Higher is better.

Model	Parameters	WER	Speaker Similarity
Phonon (Gradium)	~100M	1.48%	56.37%
Kani-TTS2	450M	4.97%	40.73%
NeuTTS Air	552M	2.18%	47.51%
NeuTTS Nano	229M	1.71%	40.15%

Source: Gradium evaluation, Seed-TTS English benchmark, 1,008 utterances. WER: Whisper large v3 + jiwer edit distance. Speaker Similarity: WavLM large cosine distance.

At approximately 100M parameters, Phonon is 2x to 5x smaller than the three competing models. It records the lowest WER (1.48%) and the highest Speaker Similarity (56.37%) of all four. The WER gap relative to Kani-TTS2 (4.97%, 450M parameters) is 3.4x, despite Phonon having one-quarter the parameter count.

Parameter count matters specifically in on-device deployment. A 100M model fits in mobile device memory and runs on a single CPU core without GPU acceleration. Models at 450M and above face practical constraints on memory and compute at the edge.

For the full benchmark methodology and per-model audio examples, see On-Device TTS Benchmark 2026: Phonon vs Kani-TTS2 vs NeuTTS on Seed-TTS.

How Does Cloud TTS Compare to On-Device TTS at a Glance?

Criterion	Cloud TTS (Gradium API)	On-Device TTS (Phonon)
Voice flexibility	Any voice, unlimited cloning	One voice per model
Language support	English, French, German, Spanish, Portuguese	One language per model
Content breadth	Any input	Scoped to use case
Network required	Yes	No
Cost model	Variable (per request)	Fixed (license per device)
Data leaves device	Yes	No
Model updates	Continuous, no redeploy	Requires app redeployment
GPU required	Server-side only	No (CPU only)
Best for	Voice agents, real-time, breadth	Consumer apps, offline, privacy

Many products will use both architectures within the same application: cloud for real-time voice agent interactions where quality and flexibility matter, on-device for high-volume playback, offline modes, or privacy-sensitive text generation.

How Should You Choose Between Cloud and On-Device TTS?

On-device TTS and cloud TTS solve different problems within a product's voice layer. The choice between architectures is determined by deployment context: cost model at volume, network availability, and data privacy requirements. They are not alternatives in the sense of one replacing the other. Many products use both within the same application.

For consumer applications at scale, offline environments, or use cases where data cannot leave the device, a scoped, finetuned on-device model is the correct architecture. For real-time voice agents, multilingual deployments, or applications requiring broad voice flexibility, the Gradium cloud API provides capabilities that on-device models are not designed to replicate.

Gradium Phonon's published benchmark results on Seed-TTS English (1.48% WER, 56.37% speaker similarity, approximately 100M parameters on a single CPU core) provide a verifiable reference point for what current on-device TTS quality looks like at the edge.