What is Gradium Phonon?

Gradium Phonon is Gradium's on-device text-to-speech model. It has approximately 100 million parameters, runs at 6 times real-time on a single CPU core, and supports voice cloning from a 10-second audio sample. It is built on Continuous Audio Language Models with flow-matching for waveform generation, and runs entirely on Android, iOS, or in a browser with no network connection required.

How accurate is Phonon compared to other on-device TTS models?

On the Seed-TTS English benchmark (May 2026), Phonon reaches 1.00 percent word error rate with voice cloning enabled and 59.51 percent speaker similarity, the lowest WER and highest similarity among comparably sized models including PocketTTS, NeuTTS Nano, NeuTTS Air, and KaniTTS2. With voice cloning disabled and a fixed voice, Phonon reaches 0.83 percent WER, ahead of Magpie, Kokoro, and Supertonic 3.

Does Phonon need a transcription of the reference voice to clone it?

No. Phonon clones a voice from a 10-second audio sample alone, with no transcription required. Some competing on-device models, including NeuTTS Air and NeuTTS Nano, require the reference audio's transcription as part of their cloning process, which adds a preparation step Phonon does not need.

What changed in Phonon's May 2026 update?

Word error rate improved from 1.48 percent to 1.00 percent with voice cloning enabled, and speaker similarity improved from 56.37 percent to 59.51 percent, compared to the April 2026 release. Gradium also removed a 100-token minimum input padding requirement, which reduces time to first audio specifically on short utterances, and added support for int8 quantization with no perceivable loss in audio quality.

When should I use Gradium Phonon instead of the Gradium cloud API?

Phonon is built for high-volume, offline, or privacy-constrained deployments where a variable per-request cloud cost is a liability: consumer apps with large freemium user bases, products that must function with no network connection, and applications where audio cannot leave the device for compliance reasons. The Gradium cloud API remains the right choice when breadth across languages, voices, and content types matters more than running fully offline. Many products use both within the same application.

How is Phonon's word error rate measured?

Generated speech is transcribed back to text using whisper-large-v3 and compared to the original input using edit distance via the jiwer package, with text normalization applied so that equivalent forms are not counted as errors. Gradium deliberately avoids using its own speech-to-text model for this step, to keep the evaluation free of any bias shared between its TTS and STT modeling approaches.

Gradium Phonon: On-Device TTS Benchmarks in 2026

Q: Is Gradium Phonon available today?

Phonon is currently in private beta. Gradium works with a limited number of partners to build a model scoped to their specific voice, language, and target device list, with delivery in days to weeks from the initial scope definition. Access can be requested at gradium.ai/on-device-tts.

Gradium Phonon is Gradium's on-device text-to-speech model. At approximately 100 million parameters, it runs entirely on a CPU, supports voice cloning from a 10-second sample, and reaches a 1.00 percent word error rate on the Seed-TTS English benchmark with cloning enabled, and 0.83 percent with a fixed voice. Both figures are the lowest of any model in their respective size class on that benchmark as of May 2026.

This article covers what Phonon is built for, the full benchmark history from its first release through its most recent update, how it compares to every other on-device TTS model currently measured against it, and how a Phonon model actually gets built and shipped.

What Gradium Phonon is

Why on-device TTS exists alongside the cloud API

The Gradium API is built for AI voice agents: customer support agents, conversational B2B assistants, outbound sales agents, any use case where the voice layer needs to handle any language, any speaker, and complex dialogue. It runs on GPU compute, which is what makes that breadth and quality possible, and it bills per request. For most products, that is the right trade. For some, it is not.

A large base of free users generating constant TTS requests on a freemium app does not fit a variable cloud cost model well. Neither does a product that needs to function with no network connection at all. Phonon exists for exactly these cases. It takes a voice, your own custom voice, a clone, or one from Gradium's catalogue, and finetunes a single-purpose model around it for a specific language and use case. The result ships as a licensed binary inside the app: no network call, no per-request cost, no dependency on connectivity.

The architecture: continuous audio language models with flow-matching

Phonon is built on Continuous Audio Language Models with flow-matching for waveform generation. This is a deliberate departure from the phonemizer-based approach used by several other on-device TTS models on the market. Phonon uses a standard text tokenizer instead, which makes it more resilient to out-of-distribution text, names, technical terms, and other inputs that fall outside the patterns a phonemizer was trained to recognize.

The result is a model that reproduces any voice, style, and accent at roughly 100 million parameters, running at 6 times real-time on a single MacBook CPU core. It is small enough to run in a browser.

When on-device TTS is the right choice

Cost model at scale

The Gradium API's cost scales with usage: more requests, more cost. That works until volume makes it unsustainable, and the clearest example is a consumer app with a freemium tier. A large free user base generating constant voice requests, many of which produce little or no revenue, breaks a per-request pricing model long before it breaks the product. Phonon replaces that with a license: a fixed cost per device and model type, with unlimited generations once deployed. At sufficient volume, the economics are categorically different, not just cheaper.

Offline and low-connectivity environments

On-device TTS runs entirely locally, with no network round trip and no dependency on connection quality. Field applications, consumer devices operating in low-connectivity regions, and any product that needs to keep working in airplane mode cannot have voice as a network-dependent feature. Vehicles and remote equipment fall into the same category: a voice agent that goes silent the moment connectivity drops is not a usable voice agent.

Data privacy and compliance

Some products cannot send user-adjacent text to an external server, whether for legal, contractual, or enterprise compliance reasons. On-device TTS is the only architecture that satisfies a constraint where text must never leave the device, without compromising voice quality for the use case it was built for. This applies directly to healthcare applications and consumer hardware where audio privacy is a hard requirement, not a preference.

Phonon benchmark results on Seed-TTS

All Phonon evaluations use the English subset of the Seed-TTS benchmark [1] (1,008 utterances, each paired with reference audio from the Common Voice dataset). Generated audio is transcribed back to text with whisper-large-v3 and compared to the input using edit distance via the jiwer package, with text normalization applied so that equivalent forms (such as "zero" versus "0") are not counted as errors. Speaker similarity is the cosine distance between WavLM-large embeddings of the reference and generated audio. Gradium deliberately avoids using its own STT model for this evaluation, to keep the transcription step free of any shared bias with the TTS model being measured.

With voice cloning: May 2026 results

Model	Weights	Word error rate	Speaker similarity
Phonon (May 2026)	~100M	1.00%	59.51%
PocketTTS	100M	1.27%	49.13%
NeuTTS Nano	229M	1.71%	40.15%
NeuTTS Air	552M	2.18%	47.51%
KaniTTS2	450M	4.97%	40.73%

Phonon leads on both metrics at roughly the same parameter count as PocketTTS, and at a fraction of the size of NeuTTS Air and KaniTTS2. The speaker similarity gap over the next-best model with comparable size (PocketTTS at 49.13 percent) is over 10 percentage points, which is a meaningful difference in how recognizable the cloned voice actually is in the output.

Without voice cloning: fixed voice results

Model	Weights	Word error rate
Phonon (May 2026)	~100M	0.83%
Supertonic 3	99M	0.95%
Magpie (NVIDIA)	357M	0.89%
Kokoro	82M	0.90%
Supertonic 2	66M	2.63%

With voice cloning disabled and a fixed high-quality voice, Phonon still leads the field at 0.83 percent WER, ahead of Magpie at 0.89 percent and Kokoro at 0.90 percent, both of which use a phonemizer-based approach rather than Phonon's standard text tokenizer. Gradium notes this comparison still uses the voice-cloning-capable version of Phonon with its voice conditioning fixed to a single speaker; a model fine-tuned specifically to one voice from the outset would be expected to do even better.

What changed between April and May 2026

Phonon's first published benchmark (April 9, 2026) recorded 1.48 percent WER and 56.37 percent speaker similarity. The May 2026 update brought that down to 1.00 percent WER and up to 59.51 percent speaker similarity, alongside two infrastructure changes: the removal of a 100-token minimum input padding requirement, and support for int8 quantization with no perceivable audio quality loss.

The padding removal matters specifically for latency on short utterances. The earlier version of Phonon needed input padded to roughly 100 tokens to reach its best quality, meaning the model computed more than was strictly necessary for a short sentence. Removing that requirement means time to first audio improves precisely on the short, frequent utterances that make up most real-world TTS traffic, since the model only computes what the input actually requires.

How Phonon compares to every on-device TTS model on the market

Combining both benchmark conditions into a single view of the on-device TTS landscape as of May 2026:

Model	Weights	WER (best available)	Voice cloning
Phonon (May 2026)	~100M	0.83% (fixed voice) / 1.00% (cloning)	Yes, 10-second sample
Supertonic 3	99M	0.95%	Not evaluated for cloning
Magpie (NVIDIA)	357M	0.89%	Not evaluated for cloning
Kokoro	82M	0.90%	No
PocketTTS	100M	1.27% (cloning)	Yes
NeuTTS Nano	229M	1.71% (cloning)	Yes, requires reference transcription
Supertonic 2	66M	2.63%	Not evaluated for cloning
NeuTTS Air	552M	2.18% (cloning)	Yes, requires reference transcription
KaniTTS2	450M	4.97% (cloning)	Yes

Two patterns stand out. First, the smallest models in this table, Kokoro at 82 million parameters and Supertonic 2 and 3 in the 66 to 99 million range, do not support voice cloning at the quality level Phonon does, or in some cases at all. Second, the models that do support cloning at usable quality, NeuTTS Air and KaniTTS2, do so at 4.5 to 5.5 times Phonon's parameter count. Phonon is the only model in this comparison that combines sub-1 percent WER, competitive speaker similarity, and a footprint small enough to run on a single CPU core.

NeuTTS Air and NeuTTS Nano also require a transcription of the reference audio to perform voice cloning. Phonon does not: a 10-second audio sample alone is enough.

How a Phonon model is built and delivered

A Phonon deployment starts with a scope definition. The partner provides the target language or languages, the voice or voices to ship (chosen from Gradium's catalogue, cloned from a 10-second sample, or built from a larger custom dataset if available), and the list of target devices. Gradium builds a model optimized for exactly that combination.

Turnaround runs from days to weeks depending on complexity. Each model is finetuned for one voice, one target language, and one content type scoped to the partner's use case, which is precisely what allows it to stay compact while still producing high-quality output for that specific context. An NPC voice for a mobile game does not need to handle enterprise call center scripts, and it does not carry the model capacity that would require.

Once finalized, the partner receives a self-contained artifact that ships directly inside their app. There are no calls to an external endpoint, no runtime network dependency, and no data leaving the device. Phonon is not designed to replace the cloud API's breadth across every language, voice, and content type; the two are built for different parts of a product's voice surface, and many products end up using both within the same application.

Get started

Phonon is currently available through private beta. Gradium scopes a model to your specific voice, language, and target device list, with delivery in days to weeks. Request access at gradium.ai/on-device-tts, or read the full benchmark write-up in Phonon reaches 1.00% WER on Seed-TTS.

Glossary

Word error rate (WER) for TTS. A measure of pronunciation accuracy in synthesized speech. Generated audio is transcribed back to text and compared to the original input via edit distance. Phonon reaches 1.00 percent WER with voice cloning and 0.83 percent with a fixed voice on the Seed-TTS English benchmark, both the lowest in their comparison group as of May 2026.

Speaker similarity. A measure of how closely a cloned voice matches the reference speaker. Computed as the cosine distance between speaker embeddings extracted from the reference and generated audio using WavLM-large. Phonon reaches 59.51 percent on the Seed-TTS benchmark, ahead of every other model evaluated with comparable cloning capability.

Seed-TTS benchmark. An English-language TTS evaluation set consisting of 1,008 utterances, each paired with reference audio from the Common Voice dataset. Used as the standard benchmark for evaluating Phonon and the other on-device models it is compared against.

Continuous Audio Language Models. The architectural family Phonon is built on, using flow-matching for waveform generation. A departure from phonemizer-based approaches used by models like Kokoro and Magpie, relying instead on a standard text tokenizer for greater resilience to out-of-distribution text.

Phonemizer-based tokenizer. A text processing approach that converts input text to phonemes before synthesis. Used by Kokoro and Magpie. Works well on standard vocabulary but can degrade on inputs outside the patterns it was trained to recognize, such as unusual names or technical terms.

int8 quantization. A model compression technique that reduces numerical precision to 8-bit integers, lowering compute and memory requirements. Phonon supports int8 quantization as of its May 2026 update, with no perceivable loss in audio quality, making on-device inference faster without a quality tradeoff.

Scoped finetuning. Gradium's delivery model for Phonon. Each deployment is finetuned for a specific combination of voice, target language, and device list defined by the partner. This scope constraint is what allows the resulting model to stay compact while still producing high-quality output for its specific use case.

References

[1] Anastassiou et al., "Seed-TTS: A Family of High-Quality Versatile Speech Generation Models," 2024. arXiv:2406.02430. Evaluation set: github.com/BytedanceSpeech/seed-tts-eval.