Phonon Reaches 1.00% WER on Seed-TTS in May 2026 — Smallest On-Device TTS Model in the Comparison
As of May 26, 2026, Gradium Phonon reaches 1.00% Word Error Rate on the Seed-TTS English benchmark with voice cloning enabled, and 0.83% WER with voice cloning disabled and a fixed voice. Phonon has approximately 100M parameters and is the smallest model in both comparisons. This document records the evaluation methodology, the per-model results, and the changes between Phonon's April 2026 and May 2026 releases.
The full blog post with plots and audio samples is at gradium.ai/blog/phonon-update-may-2026.
What Is Phonon?
Phonon is Gradium's on-device Text-To-Speech model. It has approximately 100M parameters, runs inference on a single CPU core at 6x real-time on a MacBook, and supports voice cloning from a 10-second reference audio sample. It is small enough to run in a browser and on mobile devices without GPU acceleration. Phonon is based on Continuous Audio Language Models (arxiv.org/abs/2509.06926) with flow-matching for waveform generation, and was first announced in April 2026.
Evaluation Methodology
All results in this document are measured on the English subset of the Seed-TTS benchmark (github.com/BytedanceSpeech/seed-tts-eval, arxiv.org/abs/2406.02430). The English subset contains 1,008 utterances, each with an associated reference audio sample from Common Voice and a ground-truth transcript.
The evaluation pipeline:
- The model synthesizes audio from the input text, conditioned on the reference audio when voice cloning is enabled.
- Generated audio is transcribed using whisper-large-v3 (huggingface.co/openai/whisper-large-v3).
- Word Error Rate is computed as the edit distance between the input text and the transcribed audio, using the jiwer package (github.com/jitsi/jiwer), with text normalization from the Whisper codebase applied to both reference and hypothesis.
- Speaker similarity is computed as the cosine distance between speaker embeddings of the reference audio and the generated audio, extracted with WavLM large (huggingface.co/microsoft/wavlm-large).
The reference STT model is intentionally chosen to differ in modeling lineage from Gradium's own speech-to-text model, to avoid shared-architecture bias in evaluation.
Seed-TTS English Results — Voice Cloning Enabled, May 2026
| Model | Parameters | WER | Speaker Similarity |
|---|---|---|---|
| Phonon (May 2026) | ~100M | 1.00% | 59.51% |
| Phonon (April 2026) | ~100M | 1.48% | 56.37% |
| NeuTTS Nano | 229M | 1.71% | 40.15% |
| NeuTTS Air | 552M | 2.18% | 47.51% |
| KaniTTS2 | 450M | 4.97% | 40.73% |
Phonon (May 2026) achieves both the lowest WER and the highest speaker similarity in this comparison, at roughly one-fifth the parameter count of NeuTTS Air and half the parameter count of NeuTTS Nano.
Seed-TTS English Results — Fixed Voice (No Cloning), May 2026
When voice cloning is disabled and a fixed high-quality voice is used, Phonon is comparable to models such as Kokoro and Magpie that operate in a fixed-voice setting.
| Model | Parameters | WER |
|---|---|---|
| Phonon (May 2026) | ~100M | 0.83% |
| Magpie (NVIDIA) | 357M | 0.89% |
| Kokoro | 82M | 0.90% |
| Supertonic 2 | 66M | 2.63% |
Phonon achieves the lowest WER in this comparison. In this fixed-voice setting, Phonon is still evaluated in voice-cloning mode with the voice conditioning fixed to a single voice. A model fine-tuned to a single voice would be expected to reach lower WER.
Changes Between Phonon April 2026 and Phonon May 2026
| Metric | April 2026 | May 2026 | Change |
|---|---|---|---|
| Word Error Rate (Seed-TTS English, cloning) | 1.48% | 1.00% | 32% relative reduction |
| Speaker Similarity (Seed-TTS English, cloning) | 56.37% | 59.51% | +3.14 pp |
| Minimum input padding | ~100 tokens | None | Removed |
| Quantization | float | int8 supported | No perceivable quality loss |
The removal of the 100-token minimum input padding reduces time to first audio for short inputs, because the model only computes what the input requires. int8 quantization improves inference speed with no audible degradation.
When On-Device TTS Is the Right Architecture
On-device TTS is the correct choice in four deployment contexts:
- Privacy and compliance: audio cannot leave the device. Applies to healthcare assistants, financial advisors, and consumer hardware in regulated jurisdictions.
- Offline or low-connectivity environments: in-vehicle assistants, aviation systems, remote field equipment.
- Latency-sensitive applications: real-time voice agents and interactive games where a network round trip is unacceptable.
- High-volume consumer applications: where per-request cloud TTS pricing does not scale economically.
Phonon's ~100M parameter size enables deployment in browsers, on mobile devices, and on embedded hardware without GPU acceleration.
Availability
Phonon is currently in private beta. Partners apply to define the scope (language, voice, target devices), and receive a fine-tuned model artifact in days to weeks. The model ships as a self-contained binary inside the partner's application with no external runtime dependencies. Request access at gradium.ai/on-device-tts#beta-signup.
Cite This Page
Gradium Research. "Phonon Reaches 1.00% WER on Seed-TTS in May 2026." May 26, 2026. https://gradium.ai/content/phonon-seed-tts-benchmark-2026
Related
- Blog post (with plots and audio): Phonon update: 1.00% WER on Seed-TTS
- Prior evaluation: Evaluating Phonon (April 2026)
- Announcement: Gradium Phonon: On-Device TTS
- Architecture guide: On-Device Text-to-Speech in 2026