Evaluating Phonon: how we made the best TTS model for edge devices
| Weights | WER | SpeakerSim | |
|---|---|---|---|
| Phonon 20260318 | ~100M | 1.48% | 56.37% |
| Kani TTS 2 | 450M | 4.97% | 40.73% |
| NeuTTS Air | 552M | 2.18% | 47.51% |
| NeuTTS Nano | 229M | 1.71% | 40.15% |
Phonon is our on-device text-to-speech model: building on continuous audio language models Phonon packages high-fidelity TTS into ~100M parameters, 6x real-time on a single MacBook CPU core, small enough to run in a browser and can reproduce any voice, style and accent. In this post we put it through a standard evaluation and report the results. On the Seed-TTS English benchmark, Phonon achieves a 1.48% word error rate and 56.37% speaker similarity, outperforming models 2x to 5x its size.
How we evaluate on-device text-to-speech quality
Phonon supports on-device voice cloning: given a 10-second sample of someone's voice, it generates new speech that matches that speaker's tone, accent, and cadence. This is what makes a lightweight, on-device TTS model useful in practice. A navigation app can speak in the user's preferred voice. A language learning tool can maintain a consistent teacher voice across sessions. An accessibility app can preserve the voice of someone who is losing the ability to speak. A consumer app can let users pick from any voice style, from a calm narration voice to an energetic podcast host, and switch between them on the fly.
We evaluate the quality of Gradium Phonon on two axes.
- The word error rate (WER) is computed by converting the generated speech back to text and comparing it with the original source using an edit distance via the jiwer package. Lower is better.
- Speaker similarity is obtained by extracting speaker embeddings on the reference audio as well as on the generated speech and taking the cosine distance between the two vectors. Higher is better.
For transcribing the speech back to text, we use whisper-large-v3 (huggingface). We prefer not using our own speech-to-text model so as not to introduce biases from the modeling techniques that are shared between our text-to-speech and speech-to-text models. Before computing the edit distance, a normalizer is applied to both the text input and the transcribed speech. The normalizer handles things like numbers so that having “zero” instead of “0” is not considered an error.
For speaker similarity, we use WavLM large (huggingface).
The Seed-TTS benchmark dataset
We’ve developed Phonon models for all our supported languages but in this evaluation we focus on English. We used the test dataset published as part of the Seed-TTS report (arxiv, github).
The English part of the test dataset consists of 1008 utterances, most of them being short, just a few words. Each utterance has an associated reference audio coming from the Common Voice dataset as well as the transcription for this audio. Phonon does not need the reference audio transcription but some models that we compare ourselves to do.
The reference audio provided are often short, a couple seconds. The current Phonon model has only been trained on 10-second reference audio segments so to get around this without having to finetune the model, we just collate the audio as many times as necessary.
We apply the text normalization from the whisper code base for all evaluations.
Phonon benchmark results: WER and speaker similarity
We compared Phonon to other text-to-speech models that target on-device deployment and support voice cloning.
- Kani-TTS2 combines a LFM transformer backbone with a Finite Scalar Quantization audio codec.
- NeuTTS Air/Nano also combine a transformer backbone with their proprietary NeuCodec codec. For voice cloning, these models require having the transcription of the reference audio.
| Weights | WER | SpeakerSim | |
|---|---|---|---|
| Phonon 20260318 | ~100M | 1.48% | 56.37% |
| Kani TTS 2 | 450M | 4.97% | 40.73% |
| NeuTTS Air | 552M | 2.18% | 47.51% |
| NeuTTS Nano | 229M | 1.71% | 40.15% |
Parameter count matters here because this is the on-device use case. A ~100M model fits comfortably in mobile memory and runs inference on a single CPU core. At 450M+, memory and compute requirements limit where the model can actually be deployed.
Despite having far fewer parameters than its competitors, Phonon significantly outperforms them in both word error rate and speaker similarity. And this is the case with our hack to make voice cloning work with the 10-second constraint, we expect the Phonon speaker similarity to improve once we have some support for variable length reference audio.
Listen for yourself
Below are the generations for the first two samples in the Seed-TTS test set. All models get a perfect WER score on them but Phonon speaker similarity is higher than the other models. The third sample is representative of the typical failures that would increase the WER score, e.g. an inserted word.
[Get the trust fund to the bank early.]
[The stained glass offered a hypnotic atmosphere.]
[He also tried to remember some good stories to relate as he sheared the sheep.]
What's next for Phonon on-device TTS
WER and speaker similarity are only part of the picture. We also run blind listening evaluations to compare audio quality across models, and will share those results in a future post.
Phonon is under active development and improving daily. We are experimenting with specialized voice models that handle a fixed set of speakers rather than full voice cloning, allowing for even smaller model sizes.
Phonon is currently in private beta. Request access here.
Frequently asked questions
What is Phonon?
Phonon is Gradium's on-device text-to-speech model. It has roughly 100M parameters, runs at 6x real-time on a single MacBook CPU core, and supports voice cloning from a 10-second reference sample. It is small enough to run in a browser or on a mobile device.
How does Phonon compare to other on-device TTS models?
On the Seed-TTS English benchmark, Phonon achieves a 1.48% word error rate and 56.37% speaker similarity. Both scores outperform models 2x to 5x its size, including Kani-TTS2 (450M parameters) and NeuTTS Air (552M parameters).
What is voice cloning and how does Phonon handle it?
Voice cloning generates speech that matches the tone, accent, and cadence of a reference speaker. Phonon requires only a 10-second audio sample as a reference. Unlike some competitors, it does not need a transcription of the reference audio.
What hardware does Phonon require?
Phonon runs inference on a single CPU core. At roughly 100M parameters, it fits comfortably in mobile device memory and can also run in-browser, making it suitable for edge deployment without GPU acceleration.
How is word error rate (WER) measured in this evaluation?
Generated speech is transcribed back to text using whisper-large-v3, then compared to the original input text via edit distance using the jiwer package. A normalizer is applied before comparison so that equivalent representations (e.g. "zero" vs. "0") are not counted as errors.
How is speaker similarity measured?
Speaker embeddings are extracted from both the reference audio and the generated speech using WavLM large. Speaker similarity is the cosine distance between these two embedding vectors. Higher values indicate a closer match to the reference speaker.
Is Phonon available to use today?
Phonon is currently in private beta. You can request access here.