← Back to Blog

Phonon update: 1.00% WER on Seed-TTS, smaller than every model we beat

5 min read
Scatter plot of on-device TTS models comparing parameter count (millions) and word error rate (%) on the Seed-TTS English benchmark. Phonon (May 2026) sits in the bottom-left corner — the smallest and most accurate — at ~100M parameters and 0.83% WER, ahead of Kokoro (82M, 0.90%), Magpie (357M, 0.89%), Supertonic 2 (66M, 2.63%), NeuTTS nano (229M, 1.71%), NeuTTS air (552M, 2.18%) and KaniTTS2 (450M, 4.97%).

Parameter count vs Word Error Rate on the Seed-TTS English benchmark. Voice cloning is disabled for this comparison. Evaluated May 2026.

Phonon, our 100M-parameter on-device Text-To-Speech model, now reaches a 1.00% word error rate on the Seed-TTS English benchmark, outperforming NeuTTS Air (552M), KaniTTS2 (450M), and NeuTTS Nano (229M). With voice cloning disabled and a fixed high-quality voice, Phonon drops to 0.83% WER, ahead of Kokoro and Magpie. Phonon is based on Continuous Audio Language Models with flow-matching for waveform generation, first announced in April. It is in private beta. You can request access here.

On-device Text-To-Speech enables deployment scenarios that cloud APIs cannot serve: offline voice agents in vehicles and remote equipment, latency-sensitive products where a network round trip is unacceptable, privacy-sensitive applications in healthcare and consumer hardware where audio cannot leave the device, and high-volume applications where per-request API costs are prohibitive. Phonon runs entirely on the edge, removing the network from the voice pipeline.

This post compares the updated Phonon against the version from our previous benchmark post. All evaluations use the English subset of the Seed-TTS benchmark (arxiv, github).

What changed since the April release

  • Word error rate: 1.48% → 1.00%
  • Speaker similarity: 56.37% → 59.51%
  • Padding removed: no more 100-token minimum input length
  • int8 quantization supported with no audio quality loss

Results are reported on the Seed-TTS English test set. Generated audio is transcribed with whisper-large-v3 and compared to input text using jiwer with text normalization. Speaker similarity is the cosine distance between WavLM-large embeddings of the reference and generated audio.

Side-by-side bar charts on the Seed-TTS English benchmark. Left: Word Error Rate, where Phonon (May 2026) reaches 1.00%, ahead of Phonon (April 2026) at 1.48%, NeuTTS nano at 1.71%, NeuTTS air at 2.18%, and KaniTTS2 at 4.97%. Right: Speaker Similarity, where Phonon (May 2026) reaches 59.51%, ahead of Phonon (April 2026) at 56.37%, NeuTTS air at 47.51%, KaniTTS2 at 40.73%, and NeuTTS nano at 40.15%.
Model Weights WER Speaker Similarity
Phonon (May 2026) ≈100M 1.00% 59.51%
Phonon (April 2026) ≈100M 1.48% 56.37%
NeuTTS nano 229M 1.71% 40.15%
NeuTTS air 552M 2.18% 47.51%
KaniTTS2 450M 4.97% 40.73%

Phonon achieves the lowest WER and highest speaker similarity in the table at roughly one-fifth the parameter count of NeuTTS Air (552M) and KaniTTS2 (450M), and at roughly half the size of NeuTTS Nano (229M).

Phonon with a fixed voice: 0.83% WER, beating Kokoro and Magpie

We also ran some evaluations on the same dataset without the voice cloning feature so as to compare Phonon with models such as Kokoro or Magpie by NVIDIA. As voice cloning is not required anymore, we use a fixed high quality voice for Phonon. In this setup, Phonon outperforms all the other models bringing the word error rate down to an unprecedented 0.83%.

Here we still evaluate the voice cloning enabled version of Phonon, we just fix the voice conditioning to a single voice, we would expect reaching even better word error rate by fine-tuning the model to a single voice. Phonon is based on a standard text tokenizer whereas Kokoro and Magpie use a phonemizer based approach. This allows Phonon to be more resilient to out-of-distribution text.

Bar chart of Word Error Rate on the Seed-TTS English benchmark for small fixed-voice TTS models. Phonon (May 2026) reaches 0.83% WER, ahead of Magpie at 0.89%, Kokoro at 0.90%, and Supertonic 2 at 2.63%.
Model Weights WER
Phonon (May 2026) ≈100M 0.83%
Kokoro 82M 0.90%
Magpie 357M 0.89%
Supertonic 2 66M 2.63%

Listen for yourself

Five consecutive segments synthesized by the May 2026 Phonon model.

0:00 / 0:00
0:00 / 0:00
0:00 / 0:00
0:00 / 0:00
0:00 / 0:00

Simpler deployment and lower latency

We also took this opportunity to rework the Phonon inference stack. The early iteration of Phonon required the input text to be padded to a minimum length of ~100 tokens in order to reach the best quality. In the latest iteration, no padding is required at all. For short sentences, this significantly reduces time to first audio because the model only computes what's actually needed.

The inference stack has also been improved to support quantization. Using int8 results in much improved inference speed and no perceivable degradation on audio quality.

Try Phonon

Phonon is in private beta. Request access at gradium.ai/on-device-tts.

Read the previous post in this series: Evaluating Phonon.

Related posts

Frequently Asked Questions