How to Use json_config in Gradium: TTS and STT Parameters Explained

5 min read

Gradium's TTS model is inherently smart and catches most common contextual special terms and characters. But for cases where you need reliable output in every situation, dates mispronounced, numbers read awkwardly, or speech that is too fast or too flat, json_config gives you precise control.

The json_config field is added to your setup message and works for both Text-to-Speech and Speech-to-Text.

Why Use json_config in Gradium?

A basic TTS setup works, but pronunciation on dates, numbers, and emails may be off or slightly unnatural without it. Example from the tutorial:

{
  "type": "setup",
  "voice_id": "<VOICE_ID>",
  "model_name": "default",
  "output_format": "wav"
}

Adding json_config to that same setup makes output robust in 100% of cases:

{
  "type": "setup",
  "voice_id": "<VOICE_ID>",
  "model_name": "default",
  "output_format": "wav",
  "json_config": {
    "rewrite_rules": "en"
  }
}

What Are the json_config Parameters for TTS?

rewrite_rules: How Do You Normalize Dates, Numbers, Emails, and More?

rewrite_rules activates text normalization before synthesis. Pass a language alias to apply a preset bundle of rules:

"json_config": {
  "rewrite_rules": "en"
}

With "en", the model normalizes dates, times, numbers, URLs, emails, and many more before generating speech.

If your use case requires more control, you can enable only specific rules by passing a comma-separated list:

"json_config": {
  "rewrite_rules": "TimeEn,Date,NumberEn"
}

This gives you tighter control within the same voice pipeline.

padding_bonus: How Do You Control Speech Delivery Speed?

padding_bonus adjusts how fast the model speaks. The default value is 0.

Positive values slow speech delivery down:

"json_config": {
  "padding_bonus": 2.0
}

This makes the speaker slower, more deliberate, and natural. Useful for audiobooks or any use case where clarity matters more than pace.

Negative values speed it up:

"json_config": {
  "padding_bonus": -2.0
}

This produces a faster and tighter delivery. Useful for conversational agents where responsiveness is a priority.

Trying a few values is the recommended approach to find the setting that fits your use case.

temp: How Do You Control Variation and Expressiveness?

temp (temperature) controls how deterministic voice generation is. Recommended values range from 0 to 1.4. The default is 0.7.

Lower values make speech more stable and predictable:

"json_config": {
  "temp": 0.3
}

Higher values introduce more variation in how speech is delivered. This is useful when you want more expressive delivery or variations of style between different generations with the same voice.

cfg_coef: How Do You Control Voice Similarity?

cfg_coef (classifier free guidance coefficient) adjusts how closely the model matches the original voice. Recommended values range from 1.0 to 4.0. The default is 2.0.

"json_config": {
  "cfg_coef": 3.5
}

Higher values increase similarity to the original voice, but may introduce audio artifacts. Lower values allow the model more flexibility and range with the voice.

Choosing the right value depends on your preference and use case. You can experiment with this parameter in Gradium Studio across several voices.

What Are the json_config Parameters for STT?

json_config also works for Speech-to-Text. Pass it in the setup message alongside your input format:

{
  "type": "setup",
  "model_name": "default",
  "input_format": "pcm",
  "json_config": {
    "language": "en",
    "delay_in_frames": 10
  }
}

language: How Do You Improve Transcription Accuracy?

If you know the language being spoken, include it for better accuracy. If not specified, the model will figure it out on its own.

delay_in_frames: How Do You Tune Responsiveness vs. Accuracy?

delay_in_frames specifies the latency in terms of audio frames. Each audio frame is 80ms. Adjusting this value lets you tune responsiveness versus accuracy in streaming transcription.

Summary: json_config Parameters at a Glance

Parameter Scope Default What it controls
rewrite_rules TTS none Text normalization before synthesis (dates, numbers, emails, URLs)
padding_bonus TTS 0 Speech delivery speed (positive = slower, negative = faster)
temp TTS 0.7 Variation and expressiveness (range: 0 to 1.4)
cfg_coef TTS 2.0 Voice similarity to original (range: 1.0 to 4.0)
language STT auto-detected Language of the audio for better transcription accuracy
delay_in_frames STT not specified Latency in audio frames (1 frame = 80ms); tunes responsiveness vs. accuracy

Frequently Asked Questions