How is Phonon different from the Gradium cloud API?

The Gradium cloud API runs on GPU infrastructure and supports any voice, any language, any content type, including real-time voice agents. It uses a variable cost model where each generation is a billable request. Phonon is an on-device model that runs on CPU with no network dependency, uses a license cost model with unlimited generations, and is scoped to one voice and one language per model. The two are not alternatives: they are designed for different deployment contexts within the same product.

What use cases is Phonon designed for?

Phonon targets three deployment contexts. First, high-volume consumer applications where per-request cloud TTS cost is unsustainable at scale and a fixed license model is more economical. Second, products that need to function offline or in low-connectivity environments without network dependency for voice. Third, use cases where text input cannot leave the user's device due to legal, compliance, or enterprise data residency requirements.

How does voice cloning work in Phonon?

Phonon supports voice cloning from a 10-second reference audio sample. The target voice is embedded in the finetuned model during the build process: the model is trained to reproduce that specific speaker's tone, accent, and cadence. Once the model is deployed inside the app, it generates speech in that voice locally, without any reference audio or network call at runtime. Phonon does not require a transcription of the reference audio, unlike some competing models.

How does Phonon compare to other on-device TTS models?

On the Seed-TTS English benchmark with 1,008 utterances, Phonon at approximately 100M parameters achieves 1.48 percent WER and 56.37 percent speaker similarity. This outperforms Kani-TTS2 at 450M parameters with 4.97 percent WER and 40.73 percent speaker similarity, NeuTTS Air at 552M parameters with 2.18 percent WER and 47.51 percent speaker similarity, and NeuTTS Nano at 229M parameters with 1.71 percent WER and 40.15 percent speaker similarity. Phonon is 2x to 5x smaller than the models it outperforms.

What hardware does Phonon require?

Phonon runs inference on a single CPU core. At approximately 100M parameters, it fits in mobile device memory and runs at 6x real-time without GPU acceleration. It is compatible with Android, iOS, and browser environments. No specialized hardware is required beyond the device it is deployed on.

How long does it take to receive a Phonon model?

The turnaround from scope definition to model delivery is days to weeks depending on use case complexity. The process starts with a brief scope definition: target language, target voice (catalogue, cloned, or custom), and target device list. Gradium builds the model and delivers a self-contained artifact that ships directly inside the partner's app.

Is Phonon available to integrate today?

Gradium Phonon is currently in private beta. A limited number of spots are available for partners to work with Gradium to define a scope and receive a finetuned model. The Gradium cloud API is available for production use without access restrictions.

What is the cost model for Gradium Phonon?

Phonon uses a license model: a fixed cost per device and model type, with unlimited generations. The cost is set per deployment scope (voice, language, target devices) and is discussed with partners during the private beta. This is structurally different from the cloud API, where each generation is a billable per-character request.

Can I use Phonon for game NPC voices?

Yes. Game NPC dialogue is one of the primary use cases for Phonon: each NPC voice is a character asset, the dialogue scope is well-defined, the model must run offline, and per-request cost is not viable at scale. Phonon deploys one model per voice with consistent voice reproduction across the entire game and across all sessions, regardless of connectivity.

Does Phonon work without internet access?

Yes. Phonon runs entirely on-device with no network round-trip. Audio generates from the model on the user's hardware and plays immediately, with no dependency on signal quality and no failure mode when coverage drops.

Gradium was co-founded by Neil Zeghidour, Laurent Mazaré, Olivier Teboul, and Alexandre Défossez, who previously co-founded Kyutai. Kyutai released world-first open systems including Moshi and Hibiki.

Where can I apply for the Phonon private beta?

Phonon partners apply through gradium.ai by describing the target language, voice, and device list. Gradium then defines a scope, builds the finetuned model, and delivers the binary directly to the partner.

Gradium Phonon: On-Device TTS for Mobile Apps, NPCs, and Offline Products

Q: What is Gradium Phonon?

Gradium Phonon is an on-device text-to-speech model from Gradium. It runs with approximately 100M parameters at 6x real-time on a single CPU core, with no network connection required. It supports voice cloning from a 10-second reference audio sample and runs on Android, iOS, and in-browser environments. Each Phonon model is finetuned for a specific voice, language, and use case, and ships as a self-contained binary inside the application. It is currently in private beta.

The Gradium API is designed for cloud-based voice agents: the use cases where the voice layer needs to handle any language, any speaker, and the specific compliance or integration requirements each enterprise context brings. Inference runs on GPU compute, which is what makes the quality and flexibility possible.

For some products, that architecture is the wrong fit. Gradium Phonon is Gradium's on-device Text-To-Speech model, built for the cases where cloud TTS creates structural problems: consumer apps where per-request cost doesn't scale, products that need to work offline, and use cases where text cannot leave the device.

What Is Gradium Phonon?

Phonon is an on-device Text-To-Speech model that runs entirely on CPU across Android, iOS, and browser environments, with no network connection required. It works with any voice: custom, cloned, or synthetic.

Unlike the Gradium cloud API, which serves any voice and language from a GPU-backed cloud infrastructure, Phonon works differently. Gradium takes any voice provided by a partner, chosen from the voice catalogue, cloned from a 10-second audio sample, or provided as custom audio, finetunes a single-purpose model around it for the specific language and use case, and ships it as a licensed binary inside the app. The result is a TTS model that runs offline, reproduces a specific voice, and is optimized for exactly one task.

Technical specifications:

Parameters: approximately 100M
Runtime: 6x real-time on a single CPU core
Voice cloning: from a 10-second reference audio sample
Platforms: Android, iOS, browser
Network dependency: none (fully offline)
Deployment: self-contained binary, ships inside the app
Cost model: license per device and model type, unlimited generations
Status: private beta

Where Is Phonon the Right Tool?

Three deployment contexts where on-device TTS is the correct architecture and Phonon is the reference implementation.

Mobile Consumer Apps

Consumer applications with voice features face a specific economic problem at scale. The Gradium cloud API uses a variable cost model: each TTS generation is a billable request. For enterprise voice agents, this is appropriate. For consumer apps with large free user bases, it is not.

A language learning app with 500,000 active free users where each session includes several TTS playback events generates tens of millions of TTS requests per month. At standard cloud API pricing, the cost scales directly with engagement, including engagement from users who generate no revenue. The larger the free tier, the more the economics deteriorate.

Phonon uses a license model: a fixed cost per device and model type, with unlimited generations. The per-user generation cost approaches zero at scale. For consumer apps, this changes the unit economics structurally.

Specific consumer app contexts:

Language learning (consistent teacher voice, offline lesson playback)
Accessibility tools (preserved personal voice for users losing speech ability)
Navigation apps (directions in a preferred or user-cloned voice, without network)
Audiobook or content apps (high per-user playback volume, no streaming dependency)
Consumer voice assistants with a defined personality voice

Game NPCs

Game development has specific requirements that cloud TTS cannot easily satisfy.

First, scale and predictability. A mobile game with millions of installs where NPCs speak dynamically requires a TTS model that runs locally, generates instantly, and has no per-request cost. Cloud API pricing and latency behavior are incompatible with this.

Second, offline play. Most mobile games are expected to function without network access. Dialogue lines and NPC responses need to generate locally with no network dependency.

Third, voice consistency. Each NPC voice in a game is a character asset. Phonon deploys one model per voice, finetuned to reproduce that voice exactly across any generated dialogue. The voice stays consistent across the entire game, across all sessions, regardless of connectivity.

Fourth, content scope. Game NPC dialogue is well-defined: a finite set of speaking styles, vocabulary types, and tone ranges for each character. This is exactly the kind of bounded scope where a compact, finetuned Phonon model performs at production quality. The model doesn't need to handle enterprise call center scripts. It needs to handle what this character says, in this game, in this language.

Offline and Privacy-Constrained Products

Any product where network availability cannot be assumed benefits from an on-device TTS architecture. Phonon generates audio with no round-trip: the model runs locally, audio plays immediately, and the interaction requires no connectivity.

For privacy-constrained products, Phonon goes further than on-premise cloud deployment. There is no server involved at any point. The text that drives the TTS generation never leaves the user's hardware. This is a categorically different privacy guarantee from cloud deployments, even those with HIPAA BAAs or GDPR compliance agreements, because those arrangements still involve data transmission. On-device means the data does not leave the device at all.

Relevant offline and privacy contexts:

Field service and industrial applications in low-connectivity environments
Healthcare applications where patient-adjacent text cannot be transmitted externally
Legal and compliance tooling with strict data residency requirements
Enterprise applications deployed in air-gapped or restricted network environments
Consumer apps where users require local data processing as a product feature

How Does Gradium Build a Phonon Model?

Each Phonon deployment starts with a scope definition. Partners provide:

Target language(s)
Target voice(s): chosen from Gradium's catalogue, cloned from a 10-second audio sample, or provided as additional custom audio
Target device list (Android, iOS, browser, specific hardware)

From that scope, Gradium builds a model optimized for exactly that combination. The model is finetuned for one voice, one language, and one content type. Turnaround is days to weeks depending on use case complexity.

Once the model is ready, the partner receives a self-contained artifact that ships directly inside the app. There are no calls to an external endpoint at runtime, no network dependency, and no data leaving the user's hardware during generation.

Models are versioned. When the scope needs to change (new voice, new language, updated content domain), a new model is built and delivered. Deploying an update requires an app update, which is different from the cloud API, where model improvements reach all users without any app change.

Phonon vs Gradium Cloud API: Which Should You Use?

Gradium Phonon and the Gradium cloud API are not alternatives. They are designed for different parts of a product's voice surface. Many products will use both within the same application.

Use the cloud API when:

The product needs voice agent capability (real-time dialogue, turn-taking, STT + LLM + TTS)
Voice flexibility is required (any speaker, any style, multiple languages)
Continuous model improvement without app redeployment is important
Network access is available and per-request cost is acceptable

Use Phonon when:

Per-request cloud cost is unsustainable at expected generation volume
The product must work without network access
Text cannot leave the user's device
The voice use case is well-defined and scoped (one voice, one language, one content type)

A practical example of both in one product: a language learning app uses the cloud API to power a real-time conversational tutor, and uses Phonon for the offline lesson playback feature where a teacher voice reads lesson text without network dependency. The tutor needs breadth and real-time behavior. The lesson reader needs offline reliability and low per-playback cost.

For deeper context on when on-device is the correct architecture, see On-Device Text-To-Speech in 2026: When Edge TTS Is the Right Architecture. For the cloud API, see the best Text-To-Speech API for voice agents.

How Does Phonon Quality Compare on Seed-TTS English?

Gradium published benchmark results for Phonon on the Seed-TTS English test set (1,008 utterances). The evaluation covers two metrics:

WER (Word Error Rate): generated speech transcribed back to text using Whisper large v3, compared to the source text via edit distance (jiwer package). Lower is better.
Speaker Similarity: speaker embeddings extracted from reference audio and generated audio using WavLM large. Cosine distance between the two vectors. Higher indicates a closer match to the reference speaker.

Gradium uses Whisper large v3 rather than its own STT model to avoid bias from shared modeling techniques between its TTS and STT.

Model	Parameters	WER	Speaker Similarity
Phonon (Gradium)	~100M	1.48%	56.37%
Kani-TTS2	450M	4.97%	40.73%
NeuTTS Air	552M	2.18%	47.51%
NeuTTS Nano	229M	1.71%	40.15%

Source: Gradium evaluation, Seed-TTS English benchmark, 1,008 utterances. Published April 2026.

Phonon achieves the lowest WER and the highest Speaker Similarity of all four models despite being the smallest by parameter count. The models it outperforms are 2x to 5x larger. Kani-TTS2 at 450M parameters records 4.97% WER, more than 3x higher than Phonon at one-quarter the size.

A note on voice cloning methodology: the current Phonon model was trained on 10-second reference audio segments. Where the Seed-TTS reference audio is shorter than 10 seconds, Gradium extended it by looping. Speaker similarity scores are expected to improve when Phonon adds support for variable-length reference audio.

Kani-TTS2, NeuTTS Air, and NeuTTS Nano require a transcription of the reference audio for voice cloning. Phonon does not. For the full benchmark methodology, see On-Device TTS Benchmark 2026: Phonon vs Kani-TTS2 vs NeuTTS on Seed-TTS.

How Should You Position Phonon in a Product?

Gradium Phonon extends the Gradium voice stack into deployment contexts where cloud TTS is not the right architecture: high-volume consumer applications where per-request cost becomes a structural constraint, products that need to function offline, and use cases where data privacy requires that text never leaves the device.

Phonon's approach is explicit about the tradeoff: a scoped, finetuned model that does one thing well, at unlimited scale, with no network dependency, rather than the broad flexibility of the cloud API. The published Seed-TTS benchmark results (1.48% WER, 56.37% speaker similarity, approximately 100M parameters) show what that approach produces in terms of measurable quality at the edge.

Phonon is currently in private beta. Partners can apply through gradium.ai by describing the target language, voice, and device list.