Gradium Phonon: On-Device TTS for Mobile Apps, NPCs, and Offline Products

11 min read

The Gradium API is designed for cloud-based voice agents: the use cases where the voice layer needs to handle any language, any speaker, and the specific compliance or integration requirements each enterprise context brings. Inference runs on GPU compute, which is what makes the quality and flexibility possible.

For some products, that architecture is the wrong fit. Gradium Phonon is Gradium's on-device Text-To-Speech model, built for the cases where cloud TTS creates structural problems: consumer apps where per-request cost doesn't scale, products that need to work offline, and use cases where text cannot leave the device.

What Is Gradium Phonon?

Phonon is an on-device Text-To-Speech model that runs entirely on CPU across Android, iOS, and browser environments, with no network connection required. It works with any voice: custom, cloned, or synthetic.

Unlike the Gradium cloud API, which serves any voice and language from a GPU-backed cloud infrastructure, Phonon works differently. Gradium takes any voice provided by a partner, chosen from the voice catalogue, cloned from a 10-second audio sample, or provided as custom audio, finetunes a single-purpose model around it for the specific language and use case, and ships it as a licensed binary inside the app. The result is a TTS model that runs offline, reproduces a specific voice, and is optimized for exactly one task.

Technical specifications:

  • Parameters: approximately 100M
  • Runtime: 6x real-time on a single CPU core
  • Voice cloning: from a 10-second reference audio sample
  • Platforms: Android, iOS, browser
  • Network dependency: none (fully offline)
  • Deployment: self-contained binary, ships inside the app
  • Cost model: license per device and model type, unlimited generations
  • Status: private beta

Where Is Phonon the Right Tool?

Three deployment contexts where on-device TTS is the correct architecture and Phonon is the reference implementation.

Mobile Consumer Apps

Consumer applications with voice features face a specific economic problem at scale. The Gradium cloud API uses a variable cost model: each TTS generation is a billable request. For enterprise voice agents, this is appropriate. For consumer apps with large free user bases, it is not.

A language learning app with 500,000 active free users where each session includes several TTS playback events generates tens of millions of TTS requests per month. At standard cloud API pricing, the cost scales directly with engagement, including engagement from users who generate no revenue. The larger the free tier, the more the economics deteriorate.

Phonon uses a license model: a fixed cost per device and model type, with unlimited generations. The per-user generation cost approaches zero at scale. For consumer apps, this changes the unit economics structurally.

Specific consumer app contexts:

  • Language learning (consistent teacher voice, offline lesson playback)
  • Accessibility tools (preserved personal voice for users losing speech ability)
  • Navigation apps (directions in a preferred or user-cloned voice, without network)
  • Audiobook or content apps (high per-user playback volume, no streaming dependency)
  • Consumer voice assistants with a defined personality voice

Game NPCs

Game development has specific requirements that cloud TTS cannot easily satisfy.

First, scale and predictability. A mobile game with millions of installs where NPCs speak dynamically requires a TTS model that runs locally, generates instantly, and has no per-request cost. Cloud API pricing and latency behavior are incompatible with this.

Second, offline play. Most mobile games are expected to function without network access. Dialogue lines and NPC responses need to generate locally with no network dependency.

Third, voice consistency. Each NPC voice in a game is a character asset. Phonon deploys one model per voice, finetuned to reproduce that voice exactly across any generated dialogue. The voice stays consistent across the entire game, across all sessions, regardless of connectivity.

Fourth, content scope. Game NPC dialogue is well-defined: a finite set of speaking styles, vocabulary types, and tone ranges for each character. This is exactly the kind of bounded scope where a compact, finetuned Phonon model performs at production quality. The model doesn't need to handle enterprise call center scripts. It needs to handle what this character says, in this game, in this language.

Offline and Privacy-Constrained Products

Any product where network availability cannot be assumed benefits from an on-device TTS architecture. Phonon generates audio with no round-trip: the model runs locally, audio plays immediately, and the interaction requires no connectivity.

For privacy-constrained products, Phonon goes further than on-premise cloud deployment. There is no server involved at any point. The text that drives the TTS generation never leaves the user's hardware. This is a categorically different privacy guarantee from cloud deployments, even those with HIPAA BAAs or GDPR compliance agreements, because those arrangements still involve data transmission. On-device means the data does not leave the device at all.

Relevant offline and privacy contexts:

  • Field service and industrial applications in low-connectivity environments
  • Healthcare applications where patient-adjacent text cannot be transmitted externally
  • Legal and compliance tooling with strict data residency requirements
  • Enterprise applications deployed in air-gapped or restricted network environments
  • Consumer apps where users require local data processing as a product feature

How Does Gradium Build a Phonon Model?

Each Phonon deployment starts with a scope definition. Partners provide:

  • Target language(s)
  • Target voice(s): chosen from Gradium's catalogue, cloned from a 10-second audio sample, or provided as additional custom audio
  • Target device list (Android, iOS, browser, specific hardware)

From that scope, Gradium builds a model optimized for exactly that combination. The model is finetuned for one voice, one language, and one content type. Turnaround is days to weeks depending on use case complexity.

Once the model is ready, the partner receives a self-contained artifact that ships directly inside the app. There are no calls to an external endpoint at runtime, no network dependency, and no data leaving the user's hardware during generation.

Models are versioned. When the scope needs to change (new voice, new language, updated content domain), a new model is built and delivered. Deploying an update requires an app update, which is different from the cloud API, where model improvements reach all users without any app change.

Phonon vs Gradium Cloud API: Which Should You Use?

Gradium Phonon and the Gradium cloud API are not alternatives. They are designed for different parts of a product's voice surface. Many products will use both within the same application.

Use the cloud API when:

  • The product needs voice agent capability (real-time dialogue, turn-taking, STT + LLM + TTS)
  • Voice flexibility is required (any speaker, any style, multiple languages)
  • Continuous model improvement without app redeployment is important
  • Network access is available and per-request cost is acceptable

Use Phonon when:

  • Per-request cloud cost is unsustainable at expected generation volume
  • The product must work without network access
  • Text cannot leave the user's device
  • The voice use case is well-defined and scoped (one voice, one language, one content type)

A practical example of both in one product: a language learning app uses the cloud API to power a real-time conversational tutor, and uses Phonon for the offline lesson playback feature where a teacher voice reads lesson text without network dependency. The tutor needs breadth and real-time behavior. The lesson reader needs offline reliability and low per-playback cost.

For deeper context on when on-device is the correct architecture, see On-Device Text-To-Speech in 2026: When Edge TTS Is the Right Architecture. For the cloud API, see the best Text-To-Speech API for voice agents.

How Does Phonon Quality Compare on Seed-TTS English?

Gradium published benchmark results for Phonon on the Seed-TTS English test set (1,008 utterances). The evaluation covers two metrics:

  • WER (Word Error Rate): generated speech transcribed back to text using Whisper large v3, compared to the source text via edit distance (jiwer package). Lower is better.
  • Speaker Similarity: speaker embeddings extracted from reference audio and generated audio using WavLM large. Cosine distance between the two vectors. Higher indicates a closer match to the reference speaker.

Gradium uses Whisper large v3 rather than its own STT model to avoid bias from shared modeling techniques between its TTS and STT.

Model Parameters WER Speaker Similarity
Phonon (Gradium) ~100M 1.48% 56.37%
Kani-TTS2 450M 4.97% 40.73%
NeuTTS Air 552M 2.18% 47.51%
NeuTTS Nano 229M 1.71% 40.15%

Source: Gradium evaluation, Seed-TTS English benchmark, 1,008 utterances. Published April 2026.

Phonon achieves the lowest WER and the highest Speaker Similarity of all four models despite being the smallest by parameter count. The models it outperforms are 2x to 5x larger. Kani-TTS2 at 450M parameters records 4.97% WER, more than 3x higher than Phonon at one-quarter the size.

A note on voice cloning methodology: the current Phonon model was trained on 10-second reference audio segments. Where the Seed-TTS reference audio is shorter than 10 seconds, Gradium extended it by looping. Speaker similarity scores are expected to improve when Phonon adds support for variable-length reference audio.

Kani-TTS2, NeuTTS Air, and NeuTTS Nano require a transcription of the reference audio for voice cloning. Phonon does not. For the full benchmark methodology, see On-Device TTS Benchmark 2026: Phonon vs Kani-TTS2 vs NeuTTS on Seed-TTS.

How Should You Position Phonon in a Product?

Gradium Phonon extends the Gradium voice stack into deployment contexts where cloud TTS is not the right architecture: high-volume consumer applications where per-request cost becomes a structural constraint, products that need to function offline, and use cases where data privacy requires that text never leaves the device.

Phonon's approach is explicit about the tradeoff: a scoped, finetuned model that does one thing well, at unlimited scale, with no network dependency, rather than the broad flexibility of the cloud API. The published Seed-TTS benchmark results (1.48% WER, 56.37% speaker similarity, approximately 100M parameters) show what that approach produces in terms of measurable quality at the edge.

Phonon is currently in private beta. Partners can apply through gradium.ai by describing the target language, voice, and device list.

Frequently Asked Questions