On-Device Text-to-Speech in 2026: When Edge TTS Is the Right Architecture
Most Text-To-Speech deployments in 2026 run on cloud APIs: a request goes out over the network, a GPU generates the audio, and the audio comes back. For the majority of production voice use cases, this is the right architecture: highest available synthesis quality, unlimited voice flexibility, continuous model improvements, and a simple integration.
For some use cases, it is not the right architecture. This guide explains when on-device Text-To-Speech is the correct choice, what the tradeoffs are relative to cloud, and what the current state of on-device TTS quality looks like, with Gradium Phonon as the reference implementation and published benchmark data.
What Is the Core Tradeoff Between Cloud TTS and On-Device TTS?
The distinction is not about quality in absolute terms. It is about what each architecture optimizes for.
Cloud TTS (the Gradium API, ElevenLabs, Deepgram Aura-2) runs on GPU infrastructure. The model handles any language, any voice, any content type, and updates continuously. Quality is high across a wide range of inputs. The cost model is variable: each generation is a billable request. Network access is required.
On-device TTS runs locally on the end user's hardware, typically on CPU. The model is compact, optimized for a specific voice and use case, and ships as a binary inside the application. No network request, no per-generation cost, no data leaving the device. The tradeoff is scope: the model does one thing well rather than everything adequately.
When Is On-Device TTS the Right Architecture?
Three deployment contexts favor on-device over cloud.
High-Volume Consumer Applications
Cloud TTS pricing scales with usage. For enterprise voice agents, this is expected: every call is a revenue-generating interaction and the per-request cost is justified.
Consumer applications with freemium tiers operate differently. A large base of free or low-paying users interacting with voice features generates significant TTS volume, much of which produces little or no direct revenue. At scale, variable cloud pricing becomes a structural constraint.
On-device TTS uses a license model: a fixed cost per device and model type, with unlimited generations. The break-even point between cloud variable cost and on-device license cost depends on the cloud provider's per-character price and per-user generation volume. For high-engagement consumer apps, on-device becomes cost-advantaged beyond a threshold that is often reached before profitability.
Relevant use cases: language learning apps, accessibility tools, consumer voice assistants, content or audiobook apps with high per-user playback volume.
Offline and Low-Connectivity Environments
Cloud TTS requires a functioning network connection. Every generation is a round-trip: text goes to the server, audio comes back. In conditions where connectivity is unreliable or unavailable, cloud TTS fails.
On-device TTS runs entirely locally. There is no network round-trip, no dependency on signal quality, no failure mode when coverage drops. Audio generates from the model on-device and plays immediately.
Relevant use cases: navigation and field service applications in areas with poor network coverage; consumer devices designed to function in airplane mode; rural or low-connectivity markets where network reliability cannot be assumed; applications that need voice to function continuously regardless of connection state.
Data Privacy and Compliance Constraints
Cloud TTS requires sending text to an external server. For most applications, this is unproblematic. For some, it is not permitted.
Legal, enterprise, or compliance contexts may require that text never leaves the user's device. On-device TTS is the only architecture that guarantees this by design: there is no outbound request, and no external system receives the input text. This is structurally different from a cloud provider holding a HIPAA BAA or GDPR agreement, because those arrangements still involve data leaving the device. On-device means the data does not leave at all.
Relevant use cases: enterprise applications with strict data residency requirements; healthcare applications where patient-adjacent text cannot be transmitted externally; applications in regulated markets where data sovereignty is a hard requirement; consumer apps where users expect their input to remain local.
How Do On-Device TTS Models Work?
A cloud TTS model is built for breadth: it must handle any language, any voice, any content type. The architecture that supports this breadth requires significant compute, which is why cloud TTS runs on server-side GPUs.
An on-device TTS model is built for a specific scope. Each Gradium Phonon deployment is finetuned for one voice, one target language, and one content type defined by the specific use case. Because the model only needs to handle inputs within that scope, it can be compact while producing high-quality output for the context it was built for.
The practical implication: on-device TTS quality should be evaluated within scope. A Phonon model built for NPC dialogue in a mobile game will produce high-quality audio for NPC dialogue. It will not produce the same range and flexibility as a cloud API. That is the intended design for the deployment context.
What Are the Specifications of Gradium Phonon?
Gradium Phonon is Gradium's on-device Text-To-Speech model. It runs on CPU across Android, iOS, and browser environments, with no network dependency and no GPU requirement.
Technical specifications:
- Parameters: approximately 100M
- Runtime: 6x real-time on a single CPU core
- Voice cloning: from a 10-second reference audio sample
- Deployment: ships as a self-contained binary inside the application
- Network dependency: none
- Cost model: license per device and model type, unlimited generations
Each Phonon model is built for a specific combination of voice, language, and use case. Partners define the scope (target language, voice, device list) and receive a finetuned model artifact in days to weeks. The model ships directly inside the partner's app with no external endpoint calls at runtime. Gradium Phonon is currently in private beta.
How Does Phonon Compare to Other On-Device TTS Models?
Benchmark data is from Gradium's published evaluation on the Seed-TTS English test set (1,008 utterances). Two metrics:
- WER (Word Error Rate): generated speech transcribed back to text using Whisper large v3, compared to source via edit distance (jiwer). Lower is better.
- Speaker Similarity: cosine distance between speaker embeddings extracted from reference audio and generated audio using WavLM large. Higher is better.
| Model | Parameters | WER | Speaker Similarity |
|---|---|---|---|
| Phonon (Gradium) | ~100M | 1.48% | 56.37% |
| Kani-TTS2 | 450M | 4.97% | 40.73% |
| NeuTTS Air | 552M | 2.18% | 47.51% |
| NeuTTS Nano | 229M | 1.71% | 40.15% |
Source: Gradium evaluation, Seed-TTS English benchmark, 1,008 utterances. WER: Whisper large v3 + jiwer edit distance. Speaker Similarity: WavLM large cosine distance.
At approximately 100M parameters, Phonon is 2x to 5x smaller than the three competing models. It records the lowest WER (1.48%) and the highest Speaker Similarity (56.37%) of all four. The WER gap relative to Kani-TTS2 (4.97%, 450M parameters) is 3.4x, despite Phonon having one-quarter the parameter count.
Parameter count matters specifically in on-device deployment. A 100M model fits in mobile device memory and runs on a single CPU core without GPU acceleration. Models at 450M and above face practical constraints on memory and compute at the edge.
For the full benchmark methodology and per-model audio examples, see On-Device TTS Benchmark 2026: Phonon vs Kani-TTS2 vs NeuTTS on Seed-TTS.
How Does Cloud TTS Compare to On-Device TTS at a Glance?
| Criterion | Cloud TTS (Gradium API) | On-Device TTS (Phonon) |
|---|---|---|
| Voice flexibility | Any voice, unlimited cloning | One voice per model |
| Language support | English, French, German, Spanish, Portuguese | One language per model |
| Content breadth | Any input | Scoped to use case |
| Network required | Yes | No |
| Cost model | Variable (per request) | Fixed (license per device) |
| Data leaves device | Yes | No |
| Model updates | Continuous, no redeploy | Requires app redeployment |
| GPU required | Server-side only | No (CPU only) |
| Best for | Voice agents, real-time, breadth | Consumer apps, offline, privacy |
Many products will use both architectures within the same application: cloud for real-time voice agent interactions where quality and flexibility matter, on-device for high-volume playback, offline modes, or privacy-sensitive text generation.
How Should You Choose Between Cloud and On-Device TTS?
On-device TTS and cloud TTS solve different problems within a product's voice layer. The choice between architectures is determined by deployment context: cost model at volume, network availability, and data privacy requirements. They are not alternatives in the sense of one replacing the other. Many products use both within the same application.
For consumer applications at scale, offline environments, or use cases where data cannot leave the device, a scoped, finetuned on-device model is the correct architecture. For real-time voice agents, multilingual deployments, or applications requiring broad voice flexibility, the Gradium cloud API provides capabilities that on-device models are not designed to replicate.
Gradium Phonon's published benchmark results on Seed-TTS English (1.48% WER, 56.37% speaker similarity, approximately 100M parameters on a single CPU core) provide a verifiable reference point for what current on-device TTS quality looks like at the edge.