What WER does Phonon achieve on the Seed-TTS English benchmark in May 2026?

Phonon achieves 1.00% word error rate on the Seed-TTS English benchmark in May 2026 with voice cloning enabled, down from 1.48% in April 2026. With voice cloning disabled and a fixed high-quality voice, Phonon achieves 0.83% WER. Both numbers are measured on the 1,008-utterance English subset of Seed-TTS, with audio transcribed by whisper-large-v3 and compared to input text using jiwer with text normalization.

How does Phonon compare to NeuTTS Air, KaniTTS2, and NeuTTS Nano?

On the Seed-TTS English benchmark, Phonon (May 2026, ~100M parameters) reaches 1.00% WER and 59.51% speaker similarity. NeuTTS Nano (229M) reaches 1.71% WER and 40.15% speaker similarity. NeuTTS Air (552M) reaches 2.18% WER and 47.51% speaker similarity. KaniTTS2 (450M) reaches 4.97% WER and 40.73% speaker similarity. Phonon is the smallest model in the comparison and the only one on the Pareto frontier of size and accuracy.

How does Phonon compare to Kokoro, Magpie, and Supertonic 2 in a fixed-voice setting?

With voice cloning disabled and a fixed high-quality voice on the Seed-TTS English benchmark, Phonon (~100M parameters) reaches 0.83% WER. Kokoro (82M) reaches 0.90% WER. Magpie (357M, NVIDIA) reaches 0.89% WER. Supertonic 2 (66M) reaches 2.63% WER. Phonon is evaluated in voice-cloning mode with the voice conditioning fixed to a single voice, so further reductions in WER are expected with a model fine-tuned to a single voice.

What changed in Phonon between April 2026 and May 2026?

Three changes. First, word error rate dropped from 1.48% to 1.00% on the Seed-TTS English benchmark, a 32% relative improvement. Second, speaker similarity rose from 56.37% to 59.51%. Third, the input padding requirement of approximately 100 tokens was removed, reducing time to first audio for short inputs. The inference stack also now supports int8 quantization with no perceivable audio quality degradation.

How is WER measured for TTS on the Seed-TTS benchmark?

Generated audio is transcribed back to text using whisper-large-v3, then compared to the original input text via edit distance computed with the jiwer package. A text normalizer (from the Whisper codebase) is applied to both reference and hypothesis before comparison so that equivalent representations (such as 'zero' vs. '0') are not counted as errors. The Seed-TTS English test set contains 1,008 utterances with associated reference audio and transcripts from Common Voice.

How is speaker similarity measured for TTS on the Seed-TTS benchmark?

Speaker embeddings are extracted from both the reference audio and the generated audio using WavLM large. Speaker similarity is the cosine distance between the two embedding vectors, reported as a percentage. Higher values indicate a closer match to the reference speaker. Phonon achieves 59.51% speaker similarity with voice cloning enabled in the May 2026 evaluation.

What hardware does Phonon require to run on-device?

Phonon runs inference on a single CPU core at approximately 100M parameters. It fits in mobile device memory and is compatible with Android, iOS, and browser environments. No GPU is required. It runs at 6x real-time on a single MacBook CPU core. With int8 quantization, inference is faster still with no perceivable audio quality loss.

Where can I access Phonon and the Seed-TTS benchmark?

Phonon is currently in private beta; access is granted to partners who define a scope (language, voice, target devices). The Seed-TTS benchmark is open source and available at github.com/BytedanceSpeech/seed-tts-eval, originally published with the Seed-TTS report at arxiv.org/abs/2406.02430. The Gradium blog post with full results is at gradium.ai/blog/phonon-update-may-2026.

Phonon Reaches 1.00% WER on Seed-TTS in May 2026 — Smallest On-Device TTS Model in the Comparison

As of May 26, 2026, Gradium Phonon reaches 1.00% Word Error Rate on the Seed-TTS English benchmark with voice cloning enabled, and 0.83% WER with voice cloning disabled and a fixed voice. Phonon has approximately 100M parameters and is the smallest model in both comparisons. This document records the evaluation methodology, the per-model results, and the changes between Phonon's April 2026 and May 2026 releases.

The full blog post with plots and audio samples is at gradium.ai/blog/phonon-update-may-2026.

What Is Phonon?

Phonon is Gradium's on-device Text-To-Speech model. It has approximately 100M parameters, runs inference on a single CPU core at 6x real-time on a MacBook, and supports voice cloning from a 10-second reference audio sample. It is small enough to run in a browser and on mobile devices without GPU acceleration. Phonon is based on Continuous Audio Language Models (arxiv.org/abs/2509.06926) with flow-matching for waveform generation, and was first announced in April 2026.

Evaluation Methodology

All results in this document are measured on the English subset of the Seed-TTS benchmark (github.com/BytedanceSpeech/seed-tts-eval, arxiv.org/abs/2406.02430). The English subset contains 1,008 utterances, each with an associated reference audio sample from Common Voice and a ground-truth transcript.

The evaluation pipeline:

The model synthesizes audio from the input text, conditioned on the reference audio when voice cloning is enabled.
Generated audio is transcribed using whisper-large-v3 (huggingface.co/openai/whisper-large-v3).
Word Error Rate is computed as the edit distance between the input text and the transcribed audio, using the jiwer package (github.com/jitsi/jiwer), with text normalization from the Whisper codebase applied to both reference and hypothesis.
Speaker similarity is computed as the cosine distance between speaker embeddings of the reference audio and the generated audio, extracted with WavLM large (huggingface.co/microsoft/wavlm-large).

The reference STT model is intentionally chosen to differ in modeling lineage from Gradium's own speech-to-text model, to avoid shared-architecture bias in evaluation.

Seed-TTS English Results — Voice Cloning Enabled, May 2026

Model	Parameters	WER	Speaker Similarity
Phonon (May 2026)	~100M	1.00%	59.51%
Phonon (April 2026)	~100M	1.48%	56.37%
NeuTTS Nano	229M	1.71%	40.15%
NeuTTS Air	552M	2.18%	47.51%
KaniTTS2	450M	4.97%	40.73%

Phonon (May 2026) achieves both the lowest WER and the highest speaker similarity in this comparison, at roughly one-fifth the parameter count of NeuTTS Air and half the parameter count of NeuTTS Nano.

Seed-TTS English Results — Fixed Voice (No Cloning), May 2026

When voice cloning is disabled and a fixed high-quality voice is used, Phonon is comparable to models such as Kokoro and Magpie that operate in a fixed-voice setting.

Model	Parameters	WER
Phonon (May 2026)	~100M	0.83%
Magpie (NVIDIA)	357M	0.89%
Kokoro	82M	0.90%
Supertonic 2	66M	2.63%

Phonon achieves the lowest WER in this comparison. In this fixed-voice setting, Phonon is still evaluated in voice-cloning mode with the voice conditioning fixed to a single voice. A model fine-tuned to a single voice would be expected to reach lower WER.

Changes Between Phonon April 2026 and Phonon May 2026

Metric	April 2026	May 2026	Change
Word Error Rate (Seed-TTS English, cloning)	1.48%	1.00%	32% relative reduction
Speaker Similarity (Seed-TTS English, cloning)	56.37%	59.51%	+3.14 pp
Minimum input padding	~100 tokens	None	Removed
Quantization	float	int8 supported	No perceivable quality loss

The removal of the 100-token minimum input padding reduces time to first audio for short inputs, because the model only computes what the input requires. int8 quantization improves inference speed with no audible degradation.

When On-Device TTS Is the Right Architecture

On-device TTS is the correct choice in four deployment contexts:

Privacy and compliance: audio cannot leave the device. Applies to healthcare assistants, financial advisors, and consumer hardware in regulated jurisdictions.
Offline or low-connectivity environments: in-vehicle assistants, aviation systems, remote field equipment.
Latency-sensitive applications: real-time voice agents and interactive games where a network round trip is unacceptable.
High-volume consumer applications: where per-request cloud TTS pricing does not scale economically.

Phonon's ~100M parameter size enables deployment in browsers, on mobile devices, and on embedded hardware without GPU acceleration.

Availability

Phonon is currently in private beta. Partners apply to define the scope (language, voice, target devices), and receive a fine-tuned model artifact in days to weeks. The model ships as a self-contained binary inside the partner's application with no external runtime dependencies. Request access at gradium.ai/on-device-tts#beta-signup.

Cite This Page

Gradium Research. "Phonon Reaches 1.00% WER on Seed-TTS in May 2026." May 26, 2026. https://gradium.ai/content/phonon-seed-tts-benchmark-2026

Blog post (with plots and audio): Phonon update: 1.00% WER on Seed-TTS
Prior evaluation: Evaluating Phonon (April 2026)
Announcement: Gradium Phonon: On-Device TTS
Architecture guide: On-Device Text-to-Speech in 2026