Gradium Phonon: On-Device TTS for Consumer Apps, NPCs, and Offline Products

Listen to samples generated by the base Phonon model:

Kent

0:00 / 0:00

Wendy

0:00 / 0:00

Kent

0:00 / 0:00

Wendy

0:00 / 0:00

The Gradium API is built for AI voice agents. Customer support agents, conversational B2B assistants, outbound sales agents: use cases where the voice layer needs to handle any language, any speaker, complex dialogue, and the specific compliance or integration requirements each enterprise context brings. Inference runs on GPU compute, which is what makes the quality and flexibility possible. It also means the cost model is variable: every generation is a billable request.

For most products, that's the right trade. For some, it isn't.

Gradium Phonon is an on-device text-to-speech model that runs entirely on CPU across Android and iOS, with no network connection required. It works with any voice: custom, cloned, or synthetic.

Where the Gradium API is serving any voice and language from the cloud, Phonon takes the opposite approach. We take any voice you bring us (or one from our catalogue), finetune a single-purpose model around it for your specific language and use case, and ship it as a licensed binary inside your app. The result is a TTS model that runs offline, reproduces your exact voice, and is optimized for exactly the task it needs to perform. The trade-off is scope for volume: Phonon doesn't do everything the API does, but what it does, it does at unlimited scale with no per-request cost and no network dependency.

When on-device text-to-speech is the right architecture

Several deployment contexts make a cloud TTS dependency untenable:

Scale and cost model. The Gradium API is a variable cost: usage goes up, cost goes up. That model works until the volume makes it unsustainable. Consumer apps with freemium tiers are the clearest example. A large base of free users you want to start experiencing the voice layer of your product generates a lot of requests, many of which produce little or no revenue. Gradium Phonon is a licence model: a fixed cost per device and model type, with unlimited generations. At sufficient volume, the economics are categorically different.

Offline and low-connectivity environments. On-device TTS models run entirely locally. No network round-trip, no dependency on connection quality, no failure mode when coverage drops. Field applications, consumer devices in low-connectivity regions, and any product that needs to function in airplane mode can't have voice as a network-dependent component.

Data privacy and compliance. Some products cannot send user-adjacent text to an external server. Legal, compliance, or enterprise requirements mean the only acceptable architecture is one where text never leaves the device. On-device TTS is the only approach that satisfies this constraint without compromising voice quality for the target use case.

Quality: what "finetuned for one use case" means

On-device TTS models are smaller than cloud models. That's a real constraint and worth addressing directly.

The Gradium API is built for breadth: it handles any language, any voice, any type of content, and the specific demands of real-time voice agents. That breadth requires a model architecture that can accommodate all of it.

Gradium Phonon works differently. Each model is finetuned for exactly one voice, one target language, and one content type scoped to your specific use case. Because the model doesn't need to handle anything outside that scope, it can be compact and still produce high-quality output for the context it was built for. An NPC voice model for a mobile game doesn't need to handle enterprise call centre scripts. Phonon produces high-quality audio for what it's designed to do. It doesn't produce the broad, flexible output of the cloud API, and it isn't designed to.

The practical implication: if your product has a well-defined voice use case with consistent content patterns, Phonon's quality is production-ready for that use case. If you need breadth, the API is the right tool.

How Gradium Phonon on-device TTS works

Each Phonon deployment starts with a scope definition. Partners provide the target language(s), the voice(s) they want to ship (chosen from Gradium's voice catalogue or cloned from any voice with a 10-second sample, or more data if available), and the list of target devices. From there, we build a model optimized for that exact combination.

The turnaround is days to weeks depending on use case complexity. Each model covers one voice. Once the model is finalized, you receive a self-contained artifact that ships directly inside your app, with no calls to an external endpoint, no runtime network dependency, and no data leaving the user's hardware.

Phonon and the Gradium API are not alternatives

Gradium Phonon is not a replacement for the Gradium cloud API. The two are designed for different parts of a product's voice surface.

The API gives you the highest quality synthesis across all languages and voices, unlimited voice cloning, full support for real-time voice agents, and continuous model improvements without redeploying your app. For any context where network access is available and quality or flexibility is the priority, the API remains the right tool.

The practical split: cloud for use cases where breadth and top-end quality matter; Gradium Phonon for high-volume, offline, or privacy-constrained generations where a variable-cost cloud dependency is a liability. Many products will use both within the same application.

Private beta

Gradium Phonon is currently available through a private beta. We work with a small number of partners to build a model scoped to their use case, voice, and target device list. The process starts with a brief scope definition and moves to model delivery in days to weeks. Partners ship the model directly in their product.

A few spots are open now. To apply: gradium.ai/ondevice