Gradium Phonon: On-Device TTS Benchmarks in 2026
Gradium Phonon is Gradium's on-device text-to-speech model. At approximately 100 million parameters, it runs entirely on a CPU, supports voice cloning from a 10-second sample, and reaches a 1.00 percent word error rate on the Seed-TTS English benchmark with cloning enabled, and 0.83 percent with a fixed voice. Both figures are the lowest of any model in their respective size class on that benchmark as of May 2026.
This article covers what Phonon is built for, the full benchmark history from its first release through its most recent update, how it compares to every other on-device TTS model currently measured against it, and how a Phonon model actually gets built and shipped.
What Gradium Phonon is
Why on-device TTS exists alongside the cloud API
The Gradium API is built for AI voice agents: customer support agents, conversational B2B assistants, outbound sales agents, any use case where the voice layer needs to handle any language, any speaker, and complex dialogue. It runs on GPU compute, which is what makes that breadth and quality possible, and it bills per request. For most products, that is the right trade. For some, it is not.
A large base of free users generating constant TTS requests on a freemium app does not fit a variable cloud cost model well. Neither does a product that needs to function with no network connection at all. Phonon exists for exactly these cases. It takes a voice, your own custom voice, a clone, or one from Gradium's catalogue, and finetunes a single-purpose model around it for a specific language and use case. The result ships as a licensed binary inside the app: no network call, no per-request cost, no dependency on connectivity.
The architecture: continuous audio language models with flow-matching
Phonon is built on Continuous Audio Language Models with flow-matching for waveform generation. This is a deliberate departure from the phonemizer-based approach used by several other on-device TTS models on the market. Phonon uses a standard text tokenizer instead, which makes it more resilient to out-of-distribution text, names, technical terms, and other inputs that fall outside the patterns a phonemizer was trained to recognize.
The result is a model that reproduces any voice, style, and accent at roughly 100 million parameters, running at 6 times real-time on a single MacBook CPU core. It is small enough to run in a browser.
When on-device TTS is the right choice
Cost model at scale
The Gradium API's cost scales with usage: more requests, more cost. That works until volume makes it unsustainable, and the clearest example is a consumer app with a freemium tier. A large free user base generating constant voice requests, many of which produce little or no revenue, breaks a per-request pricing model long before it breaks the product. Phonon replaces that with a license: a fixed cost per device and model type, with unlimited generations once deployed. At sufficient volume, the economics are categorically different, not just cheaper.
Offline and low-connectivity environments
On-device TTS runs entirely locally, with no network round trip and no dependency on connection quality. Field applications, consumer devices operating in low-connectivity regions, and any product that needs to keep working in airplane mode cannot have voice as a network-dependent feature. Vehicles and remote equipment fall into the same category: a voice agent that goes silent the moment connectivity drops is not a usable voice agent.
Data privacy and compliance
Some products cannot send user-adjacent text to an external server, whether for legal, contractual, or enterprise compliance reasons. On-device TTS is the only architecture that satisfies a constraint where text must never leave the device, without compromising voice quality for the use case it was built for. This applies directly to healthcare applications and consumer hardware where audio privacy is a hard requirement, not a preference.
Phonon benchmark results on Seed-TTS
All Phonon evaluations use the English subset of the Seed-TTS benchmark [1] (1,008 utterances, each paired with reference audio from the Common Voice dataset). Generated audio is transcribed back to text with whisper-large-v3 and compared to the input using edit distance via the jiwer package, with text normalization applied so that equivalent forms (such as "zero" versus "0") are not counted as errors. Speaker similarity is the cosine distance between WavLM-large embeddings of the reference and generated audio. Gradium deliberately avoids using its own STT model for this evaluation, to keep the transcription step free of any shared bias with the TTS model being measured.
With voice cloning: May 2026 results
| Model | Weights | Word error rate | Speaker similarity |
|---|---|---|---|
| Phonon (May 2026) | ~100M | 1.00% | 59.51% |
| PocketTTS | 100M | 1.27% | 49.13% |
| NeuTTS Nano | 229M | 1.71% | 40.15% |
| NeuTTS Air | 552M | 2.18% | 47.51% |
| KaniTTS2 | 450M | 4.97% | 40.73% |
Phonon leads on both metrics at roughly the same parameter count as PocketTTS, and at a fraction of the size of NeuTTS Air and KaniTTS2. The speaker similarity gap over the next-best model with comparable size (PocketTTS at 49.13 percent) is over 10 percentage points, which is a meaningful difference in how recognizable the cloned voice actually is in the output.
Without voice cloning: fixed voice results
| Model | Weights | Word error rate |
|---|---|---|
| Phonon (May 2026) | ~100M | 0.83% |
| Supertonic 3 | 99M | 0.95% |
| Magpie (NVIDIA) | 357M | 0.89% |
| Kokoro | 82M | 0.90% |
| Supertonic 2 | 66M | 2.63% |
With voice cloning disabled and a fixed high-quality voice, Phonon still leads the field at 0.83 percent WER, ahead of Magpie at 0.89 percent and Kokoro at 0.90 percent, both of which use a phonemizer-based approach rather than Phonon's standard text tokenizer. Gradium notes this comparison still uses the voice-cloning-capable version of Phonon with its voice conditioning fixed to a single speaker; a model fine-tuned specifically to one voice from the outset would be expected to do even better.
What changed between April and May 2026
Phonon's first published benchmark (April 9, 2026) recorded 1.48 percent WER and 56.37 percent speaker similarity. The May 2026 update brought that down to 1.00 percent WER and up to 59.51 percent speaker similarity, alongside two infrastructure changes: the removal of a 100-token minimum input padding requirement, and support for int8 quantization with no perceivable audio quality loss.
The padding removal matters specifically for latency on short utterances. The earlier version of Phonon needed input padded to roughly 100 tokens to reach its best quality, meaning the model computed more than was strictly necessary for a short sentence. Removing that requirement means time to first audio improves precisely on the short, frequent utterances that make up most real-world TTS traffic, since the model only computes what the input actually requires.
How Phonon compares to every on-device TTS model on the market
Combining both benchmark conditions into a single view of the on-device TTS landscape as of May 2026:
| Model | Weights | WER (best available) | Voice cloning |
|---|---|---|---|
| Phonon (May 2026) | ~100M | 0.83% (fixed voice) / 1.00% (cloning) | Yes, 10-second sample |
| Supertonic 3 | 99M | 0.95% | Not evaluated for cloning |
| Magpie (NVIDIA) | 357M | 0.89% | Not evaluated for cloning |
| Kokoro | 82M | 0.90% | No |
| PocketTTS | 100M | 1.27% (cloning) | Yes |
| NeuTTS Nano | 229M | 1.71% (cloning) | Yes, requires reference transcription |
| Supertonic 2 | 66M | 2.63% | Not evaluated for cloning |
| NeuTTS Air | 552M | 2.18% (cloning) | Yes, requires reference transcription |
| KaniTTS2 | 450M | 4.97% (cloning) | Yes |
Two patterns stand out. First, the smallest models in this table, Kokoro at 82 million parameters and Supertonic 2 and 3 in the 66 to 99 million range, do not support voice cloning at the quality level Phonon does, or in some cases at all. Second, the models that do support cloning at usable quality, NeuTTS Air and KaniTTS2, do so at 4.5 to 5.5 times Phonon's parameter count. Phonon is the only model in this comparison that combines sub-1 percent WER, competitive speaker similarity, and a footprint small enough to run on a single CPU core.
NeuTTS Air and NeuTTS Nano also require a transcription of the reference audio to perform voice cloning. Phonon does not: a 10-second audio sample alone is enough.
How a Phonon model is built and delivered
A Phonon deployment starts with a scope definition. The partner provides the target language or languages, the voice or voices to ship (chosen from Gradium's catalogue, cloned from a 10-second sample, or built from a larger custom dataset if available), and the list of target devices. Gradium builds a model optimized for exactly that combination.
Turnaround runs from days to weeks depending on complexity. Each model is finetuned for one voice, one target language, and one content type scoped to the partner's use case, which is precisely what allows it to stay compact while still producing high-quality output for that specific context. An NPC voice for a mobile game does not need to handle enterprise call center scripts, and it does not carry the model capacity that would require.
Once finalized, the partner receives a self-contained artifact that ships directly inside their app. There are no calls to an external endpoint, no runtime network dependency, and no data leaving the device. Phonon is not designed to replace the cloud API's breadth across every language, voice, and content type; the two are built for different parts of a product's voice surface, and many products end up using both within the same application.
Get started
Phonon is currently available through private beta. Gradium scopes a model to your specific voice, language, and target device list, with delivery in days to weeks. Request access at gradium.ai/on-device-tts, or read the full benchmark write-up in Phonon reaches 1.00% WER on Seed-TTS.
Glossary
Word error rate (WER) for TTS. A measure of pronunciation accuracy in synthesized speech. Generated audio is transcribed back to text and compared to the original input via edit distance. Phonon reaches 1.00 percent WER with voice cloning and 0.83 percent with a fixed voice on the Seed-TTS English benchmark, both the lowest in their comparison group as of May 2026.
Speaker similarity. A measure of how closely a cloned voice matches the reference speaker. Computed as the cosine distance between speaker embeddings extracted from the reference and generated audio using WavLM-large. Phonon reaches 59.51 percent on the Seed-TTS benchmark, ahead of every other model evaluated with comparable cloning capability.
Seed-TTS benchmark. An English-language TTS evaluation set consisting of 1,008 utterances, each paired with reference audio from the Common Voice dataset. Used as the standard benchmark for evaluating Phonon and the other on-device models it is compared against.
Continuous Audio Language Models. The architectural family Phonon is built on, using flow-matching for waveform generation. A departure from phonemizer-based approaches used by models like Kokoro and Magpie, relying instead on a standard text tokenizer for greater resilience to out-of-distribution text.
Phonemizer-based tokenizer. A text processing approach that converts input text to phonemes before synthesis. Used by Kokoro and Magpie. Works well on standard vocabulary but can degrade on inputs outside the patterns it was trained to recognize, such as unusual names or technical terms.
int8 quantization. A model compression technique that reduces numerical precision to 8-bit integers, lowering compute and memory requirements. Phonon supports int8 quantization as of its May 2026 update, with no perceivable loss in audio quality, making on-device inference faster without a quality tradeoff.
Scoped finetuning. Gradium's delivery model for Phonon. Each deployment is finetuned for a specific combination of voice, target language, and device list defined by the partner. This scope constraint is what allows the resulting model to stay compact while still producing high-quality output for its specific use case.
References
[1] Anastassiou et al., "Seed-TTS: A Family of High-Quality Versatile Speech Generation Models," 2024. arXiv:2406.02430. Evaluation set: github.com/BytedanceSpeech/seed-tts-eval.