On-Device TTS Benchmark 2026: Phonon vs Kani-TTS2 vs NeuTTS on Seed-TTS

12 min read

On-device Text-To-Speech evaluation is harder than cloud TTS evaluation. Cloud models are benchmarked on shared infrastructure with consistent compute. On-device models run on hardware that varies by deployment, and the parameter count directly constrains where the model can run at all. A 550M parameter model may benchmark well in a lab environment and be impractical on most mobile devices in production.

This article covers Gradium's published benchmark evaluation of Phonon, its on-device TTS model, against three competing on-device models: Kani-TTS2, NeuTTS Air, and NeuTTS Nano. The evaluation uses the Seed-TTS English test dataset, two standardized quality metrics, and independent tooling to avoid evaluation bias.

Why Are On-Device TTS Benchmarks Different?

For cloud TTS, the relevant production benchmarks measure latency (TTFA P50, IQR) and pronunciation accuracy (WER) under real network and compute conditions. Coval's production benchmark, for example, measures these across hundreds to over a thousand runs per model under conditions that reflect actual deployment.

For on-device TTS, latency is less meaningful as a standalone metric because it scales with the hardware it runs on. The more critical variables are:

  • WER (Word Error Rate): Does the model accurately reproduce the input text in speech? A mispronounced word in an NPC line, a navigation instruction, or an accessibility app is a production failure.
  • Speaker Similarity: Does the model accurately reproduce the target speaker's voice? Voice cloning is the primary reason to deploy a scoped on-device TTS model: if the output sounds like a generic voice instead of the target speaker, the core value proposition fails.
  • Parameter count: A model with excellent benchmark scores that requires 500M+ parameters is not deployable on most mobile devices. Parameter count determines where the model can actually run, which makes it as important as quality scores in the on-device context.

For broader context on the architecture choice, see On-Device Text-To-Speech in 2026: When Edge TTS Is the Right Architecture.

What Is the Benchmark Setup?

Dataset: Seed-TTS English

The evaluation uses the English portion of the Seed-TTS test dataset, published alongside the Seed-TTS research paper. The English test set contains 1,008 utterances, most of them short (a few words per utterance). Each utterance includes an associated reference audio from the Common Voice dataset and the corresponding transcription.

Gradium uses only the reference audio for voice cloning (not the transcription), which reflects a real-world deployment constraint: Phonon does not require reference audio transcription. Where the reference audio is shorter than 10 seconds (the current Phonon training constraint), it is extended by looping.

Metric 1: Word Error Rate (WER)

Generated speech for each utterance is transcribed back to text using Whisper large v3 (Hugging Face). The transcription is compared to the original input text using edit distance via the jiwer Python package.

A text normalizer is applied to both the input text and the transcription before comparison. This handles equivalent representations (for example, "zero" vs. "0") so that they are not counted as errors. The normalizer is from the Whisper codebase.

Gradium uses Whisper large v3 rather than its own STT model for evaluation. The reason: using its own STT would introduce bias from shared modeling techniques between Gradium's TTS and STT systems.

Lower WER = better pronunciation accuracy.

Metric 2: Speaker Similarity

Speaker embeddings are extracted from both the reference audio and the generated audio using WavLM large (Hugging Face). Speaker similarity is the cosine distance between the two embedding vectors.

Higher cosine similarity indicates that the generated speech is closer to the reference speaker's voice in the embedding space, meaning the model more accurately reproduces the target speaker's tone, accent, and cadence.

Higher speaker similarity = better voice cloning fidelity.

Which Models Are Compared?

Gradium Phonon (20260318)

Phonon is Gradium's on-device TTS model. It is based on continuous audio language models and packages high-fidelity TTS into approximately 100M parameters. It runs at 6x real-time on a single CPU core and is small enough to run in a browser. Voice cloning is supported from a 10-second reference audio sample, with no requirement for a reference audio transcription.

Kani-TTS2

Kani-TTS2 combines an LFM (Language Foundation Model) transformer backbone with a Finite Scalar Quantization audio codec. Parameter count: 450M.

NeuTTS Air

NeuTTS Air combines a transformer backbone with the proprietary NeuCodec codec. For voice cloning, NeuTTS Air requires a transcription of the reference audio in addition to the audio itself. Parameter count: 552M.

NeuTTS Nano

NeuTTS Nano is the smaller version of the NeuTTS model family, also using a transformer backbone and NeuCodec. Like NeuTTS Air, it requires a reference audio transcription for voice cloning. Parameter count: 229M.

What Are the Benchmark Results?

Model Parameters WER Speaker Similarity
Phonon (Gradium) ~100M 1.48% 56.37%
Kani-TTS2 450M 4.97% 40.73%
NeuTTS Air 552M 2.18% 47.51%
NeuTTS Nano 229M 1.71% 40.15%

Source: Gradium evaluation, Seed-TTS English benchmark, 1,008 utterances. Published April 2026. WER: Whisper large v3 + jiwer. Speaker Similarity: WavLM large cosine distance.

How Should You Read the Results?

WER Analysis

Phonon records 1.48% WER, the lowest of all four models. The gap between Phonon and the next-lowest (NeuTTS Nano at 1.71%) is 0.23 percentage points. The gap between Phonon and the highest (Kani-TTS2 at 4.97%) is 3.49 percentage points, representing more than 3x the error rate.

What these numbers mean in practice: on a 20-word utterance, a 1.48% WER means roughly 0.3 words per utterance are mispronounced or incorrectly generated. At 4.97% WER, that rises to roughly 1 word per utterance. For use cases where accurate pronunciation matters (accessibility tools, navigation, educational content), this difference is audible and consequential.

Speaker Similarity Analysis

Phonon records 56.37% speaker similarity, the highest of all four models. The nearest competitor, NeuTTS Air, records 47.51%, a gap of 8.86 percentage points. NeuTTS Nano records 40.15%, and Kani-TTS2 records 40.73%.

The speaker similarity gap is more pronounced than the WER gap. Phonon's advantage over the second-ranked model (NeuTTS Air) is 8.86 points. Its advantage over the third and fourth models (Kani-TTS2 and NeuTTS Nano) is approximately 15-16 points.

For on-device TTS use cases where voice identity matters (branded app voices, NPC character voices, personal voice preservation in accessibility tools), a 15-point speaker similarity gap is a measurable difference in how recognizable the target voice is in the generated output.

Parameter Efficiency

Phonon achieves the best scores on both metrics while being the smallest model in the comparison. At approximately 100M parameters, it is:

  • 2.3x smaller than NeuTTS Nano (229M)
  • 4.5x smaller than Kani-TTS2 (450M)
  • 5.5x smaller than NeuTTS Air (552M)

This is the most practically relevant finding for on-device deployment. The question is not which model scores best in a lab; it is which model scores best on the hardware it actually runs on in production. A 552M parameter model has limited deployment viability on mid-range mobile hardware. A 100M parameter model runs on a single CPU core, fits in mobile memory, and runs in-browser.

What Are the Limitations of This Benchmark?

Self-Published Evaluation

This benchmark was published by Gradium. It has not been independently replicated by a third party as of the date of publication. The methodology is documented and the tooling (Whisper large v3, WavLM large, jiwer) is standard and reproducible, but the results should be treated as vendor-published data pending external replication.

Looping Methodology for Short Reference Audio

The Seed-TTS reference audio segments are often shorter than 10 seconds. Phonon's current training requires 10-second reference segments. Gradium addressed this by looping short audio to the required length. This is a stated limitation, and speaker similarity scores are expected to improve when Phonon adds support for variable-length reference audio.

English-Only Evaluation

This evaluation covers the English portion of the Seed-TTS dataset. Gradium states that Phonon models have been developed for all supported languages, but benchmark data for non-English languages has not been published as of April 2026.

Seed-TTS WER vs Production TTS WER

The WER measured in this benchmark (Seed-TTS, Whisper large v3, short utterances) uses a different methodology and dataset from production TTS benchmarks such as Coval's evaluation of cloud TTS models (continuous production measurements). The two WER figures are not directly comparable. Gradium's cloud API and Gradium Phonon are also different models built for different deployment contexts (cloud GPU inference vs on-device CPU inference). Gradium's cloud API records 3.3% WER in the Coval production benchmark. Phonon records 1.48% WER on Seed-TTS. These are measured on different models, different datasets, and different methodologies, and should be interpreted independently.

What Do These Results Mean for Developers?

For developers choosing an on-device TTS model for a production deployment, the benchmark provides three actionable conclusions.

Parameter count is a deployment constraint, not just a quality variable. Models at 450M+ parameters have limited viability on most mobile devices. Filtering by deployable parameter count before comparing quality scores is the correct evaluation order.

Voice cloning fidelity (speaker similarity) varies substantially between models of similar parameter count. NeuTTS Nano (229M) and Phonon (~100M) have comparable parameter counts, but their speaker similarity scores differ by 16.22 points (40.15% vs 56.37%). If voice identity is a product requirement, parameter-count filtering alone is insufficient.

Reference audio transcription requirement affects deployment complexity. NeuTTS Air and NeuTTS Nano require a transcription of the reference audio for voice cloning. Phonon does not. For consumer applications where the reference audio is user-provided and transcription is not available, models requiring transcription add integration complexity or fail entirely.

How Should You Choose an On-Device TTS Model?

The Seed-TTS evaluation of four on-device TTS models shows that parameter count and benchmark quality do not move together in a predictable direction. Gradium Phonon at approximately 100M parameters records the lowest WER (1.48%) and highest speaker similarity (56.37%) of the four models, outperforming models 2x to 5x its size on both metrics.

For developers evaluating on-device TTS for production deployment, the relevant findings are: Phonon's parameter count makes it the most deployable option across mobile and browser environments, its speaker similarity advantage over the next-best model (NeuTTS Air, 47.51%) is significant for voice identity use cases, and its voice cloning approach does not require reference audio transcription, which simplifies integration in consumer contexts.

The benchmark methodology is documented and uses standard publicly available tooling. The results are vendor-published and have not been independently replicated as of April 2026. For the product context behind Phonon, see Gradium Phonon: On-Device TTS for Mobile Apps, NPCs, and Offline Products.

Frequently Asked Questions