Phonon is Gradium's on-device Text-To-Speech model. It has roughly 100M parameters, runs at 6x real-time on a single MacBook CPU core, and supports voice cloning from a 10-second reference sample. It is small enough to run in a browser or on a mobile device.

What are the main use cases for on-device TTS?

On-device deployment is required when (a) audio cannot leave the device for privacy or compliance reasons, such as healthcare assistants, financial advisors, or consumer toys; (b) the device operates without reliable internet, such as in-vehicle assistants, aviation systems, or remote field equipment; (c) the application is latency-sensitive and a network round trip is unacceptable, such as real-time voice agents or interactive games; or (d) per-request API costs at scale are prohibitive. Phonon's 100M parameter size lets it run in a browser, on mobile devices, or on embedded hardware without GPU acceleration.

How does Phonon compare to other on-device TTS models?

On the Seed-TTS English benchmark, Phonon achieves a 1.00% word error rate and 59.51% speaker similarity when using voice cloning. Both scores outperform models 2x to 5x its size, including KaniTTS2 (450M parameters), NeuTTS Air (552M parameters) and NeuTTS Nano (229M parameters).

What is voice cloning and how does Phonon handle it?

Voice cloning generates speech that matches the tone, accent, and cadence of a reference speaker. Phonon requires only a 10-second audio sample as a reference. Unlike some competitors, it does not need a transcription of the reference audio.

What hardware does Phonon require?

Phonon runs inference on a single CPU core. At roughly 100M parameters, it fits comfortably in mobile device memory and can also run in-browser, making it suitable for edge deployment without GPU acceleration.

How is word error rate (WER) measured in this evaluation?

Generated speech is transcribed back to text using whisper-large-v3, then compared to the original input text via edit distance using the jiwer package. A normalizer is applied before comparison so that equivalent representations (e.g. "zero" vs. "0") are not counted as errors.

How is speaker similarity measured?

Speaker embeddings are extracted from both the reference audio and the generated speech using WavLM large. Speaker similarity is the cosine distance between these two embedding vectors. Higher values indicate a closer match to the reference speaker.

Is Phonon available to use today?

Phonon is currently in private beta. You can request access at gradium.ai/on-device-tts#beta-signup.

Phonon update: 1.00% WER on Seed-TTS, smaller than every model we beat

Scatter plot of on-device TTS models comparing parameter count (millions) and word error rate (%) on the Seed-TTS English benchmark. Phonon (May 2026) sits in the bottom-left corner — the smallest and most accurate — at ~100M parameters and 0.83% WER, ahead of Kokoro (82M, 0.90%), Magpie (357M, 0.89%), Supertonic 2 (66M, 2.63%), NeuTTS nano (229M, 1.71%), NeuTTS air (552M, 2.18%) and KaniTTS2 (450M, 4.97%).

Parameter count vs Word Error Rate on the Seed-TTS English benchmark. Voice cloning is disabled for this comparison. Evaluated May 2026.

Phonon, our 100M-parameter on-device Text-To-Speech model, now reaches a 1.00% word error rate on the Seed-TTS English benchmark, outperforming NeuTTS Air (552M), KaniTTS2 (450M), and NeuTTS Nano (229M). With voice cloning disabled and a fixed high-quality voice, Phonon drops to 0.83% WER, ahead of Kokoro and Magpie. Phonon is based on Continuous Audio Language Models with flow-matching for waveform generation, first announced in April. It is in private beta. You can request access here.

On-device Text-To-Speech enables deployment scenarios that cloud APIs cannot serve: offline voice agents in vehicles and remote equipment, latency-sensitive products where a network round trip is unacceptable, privacy-sensitive applications in healthcare and consumer hardware where audio cannot leave the device, and high-volume applications where per-request API costs are prohibitive. Phonon runs entirely on the edge, removing the network from the voice pipeline.

This post compares the updated Phonon against the version from our previous benchmark post. All evaluations use the English subset of the Seed-TTS benchmark (arxiv, github).

What changed since the April release

Word error rate: 1.48% → 1.00%
Speaker similarity: 56.37% → 59.51%
Padding removed: no more 100-token minimum input length
int8 quantization supported with no audio quality loss

Results are reported on the Seed-TTS English test set. Generated audio is transcribed with whisper-large-v3 and compared to input text using jiwer with text normalization. Speaker similarity is the cosine distance between WavLM-large embeddings of the reference and generated audio.

Side-by-side bar charts on the Seed-TTS English benchmark. Left: Word Error Rate, where Phonon (May 2026) reaches 1.00%, ahead of Phonon (April 2026) at 1.48%, NeuTTS nano at 1.71%, NeuTTS air at 2.18%, and KaniTTS2 at 4.97%. Right: Speaker Similarity, where Phonon (May 2026) reaches 59.51%, ahead of Phonon (April 2026) at 56.37%, NeuTTS air at 47.51%, KaniTTS2 at 40.73%, and NeuTTS nano at 40.15%.

Model	Weights	WER	Speaker Similarity
Phonon (May 2026)	≈100M	1.00%	59.51%
Phonon (April 2026)	≈100M	1.48%	56.37%
NeuTTS nano	229M	1.71%	40.15%
NeuTTS air	552M	2.18%	47.51%
KaniTTS2	450M	4.97%	40.73%

Phonon achieves the lowest WER and highest speaker similarity in the table at roughly one-fifth the parameter count of NeuTTS Air (552M) and KaniTTS2 (450M), and at roughly half the size of NeuTTS Nano (229M).

Phonon with a fixed voice: 0.83% WER, beating Kokoro and Magpie

We also ran some evaluations on the same dataset without the voice cloning feature so as to compare Phonon with models such as Kokoro or Magpie by NVIDIA. As voice cloning is not required anymore, we use a fixed high quality voice for Phonon. In this setup, Phonon outperforms all the other models bringing the word error rate down to an unprecedented 0.83%.

Here we still evaluate the voice cloning enabled version of Phonon, we just fix the voice conditioning to a single voice, we would expect reaching even better word error rate by fine-tuning the model to a single voice. Phonon is based on a standard text tokenizer whereas Kokoro and Magpie use a phonemizer based approach. This allows Phonon to be more resilient to out-of-distribution text.

Bar chart of Word Error Rate on the Seed-TTS English benchmark for small fixed-voice TTS models. Phonon (May 2026) reaches 0.83% WER, ahead of Magpie at 0.89%, Kokoro at 0.90%, and Supertonic 2 at 2.63%.

Model	Weights	WER
Phonon (May 2026)	≈100M	0.83%
Kokoro	82M	0.90%
Magpie	357M	0.89%
Supertonic 2	66M	2.63%

Listen for yourself

Five consecutive segments synthesized by the May 2026 Phonon model.

0:00 / 0:00

Simpler deployment and lower latency

We also took this opportunity to rework the Phonon inference stack. The early iteration of Phonon required the input text to be padded to a minimum length of ~100 tokens in order to reach the best quality. In the latest iteration, no padding is required at all. For short sentences, this significantly reduces time to first audio because the model only computes what's actually needed.

The inference stack has also been improved to support quantization. Using int8 results in much improved inference speed and no perceivable degradation on audio quality.

Try Phonon

Phonon is in private beta. Request access at gradium.ai/on-device-tts.

Read the previous post in this series: Evaluating Phonon.

Phonon update: 1.00% WER on Seed-TTS, smaller than every model we beat

What changed since the April release

Phonon with a fixed voice: 0.83% WER, beating Kokoro and Magpie

Listen for yourself

Simpler deployment and lower latency

Try Phonon

Related posts

Evaluating Phonon: how we made the best TTS model for edge devices

Gradium Phonon: On-Device TTS for Consumer Apps, NPCs, and Offline Products

Frequently Asked Questions