What is json_config in Gradium?

json_config is a field added to the setup message that lets you configure advanced parameters for TTS and STT. It gives you control over text normalization, speech speed, voice expressiveness, similarity, and transcription behavior.

Does json_config work for both TTS and STT?

Yes. Some parameters are specific to TTS (rewrite_rules, padding_bonus, temp, cfg_coef) and others are specific to STT (language, delay_in_frames).

What does rewrite_rules do in TTS?

It activates text normalization before speech synthesis. With "en", the model normalizes dates, times, numbers, URLs, emails, and more. You can also pass a comma-separated list of specific rule names for tighter control.

What is the difference between temp and cfg_coef?

temp controls how much variation the model introduces in the delivery of speech. cfg_coef controls how closely the generated speech matches the original voice. They are independent parameters that address different aspects of voice generation.

What should I do if my TTS output sounds too fast, too flat, or too literal?

Look at json_config. padding_bonus controls speed, temp controls expressiveness, and rewrite_rules handles literal pronunciations of dates, numbers, and emails.

How to Use json_config in Gradium: TTS and STT Parameters Explained

Gradium's TTS model is inherently smart and catches most common contextual special terms and characters. But for cases where you need reliable output in every situation, dates mispronounced, numbers read awkwardly, or speech that is too fast or too flat, json_config gives you precise control.

The json_config field is added to your setup message and works for both Text-to-Speech and Speech-to-Text.

Why Use json_config in Gradium?

A basic TTS setup works, but pronunciation on dates, numbers, and emails may be off or slightly unnatural without it. Example from the tutorial:

{
  "type": "setup",
  "voice_id": "<VOICE_ID>",
  "model_name": "default",
  "output_format": "wav"
}

Adding json_config to that same setup makes output robust in 100% of cases:

{
  "type": "setup",
  "voice_id": "<VOICE_ID>",
  "model_name": "default",
  "output_format": "wav",
  "json_config": {
    "rewrite_rules": "en"
  }
}

What Are the json_config Parameters for TTS?

rewrite_rules: How Do You Normalize Dates, Numbers, Emails, and More?

rewrite_rules activates text normalization before synthesis. Pass a language alias to apply a preset bundle of rules:

"json_config": {
  "rewrite_rules": "en"
}

With "en", the model normalizes dates, times, numbers, URLs, emails, and many more before generating speech.

If your use case requires more control, you can enable only specific rules by passing a comma-separated list:

"json_config": {
  "rewrite_rules": "TimeEn,Date,NumberEn"
}

This gives you tighter control within the same voice pipeline.

padding_bonus: How Do You Control Speech Delivery Speed?

padding_bonus adjusts how fast the model speaks. The default value is 0.

Positive values slow speech delivery down:

"json_config": {
  "padding_bonus": 2.0
}

This makes the speaker slower, more deliberate, and natural. Useful for audiobooks or any use case where clarity matters more than pace.

Negative values speed it up:

"json_config": {
  "padding_bonus": -2.0
}

This produces a faster and tighter delivery. Useful for conversational agents where responsiveness is a priority.

Trying a few values is the recommended approach to find the setting that fits your use case.

temp: How Do You Control Variation and Expressiveness?

temp (temperature) controls how deterministic voice generation is. Recommended values range from 0 to 1.4. The default is 0.7.

Lower values make speech more stable and predictable:

"json_config": {
  "temp": 0.3
}

Higher values introduce more variation in how speech is delivered. This is useful when you want more expressive delivery or variations of style between different generations with the same voice.

cfg_coef: How Do You Control Voice Similarity?

cfg_coef (classifier free guidance coefficient) adjusts how closely the model matches the original voice. Recommended values range from 1.0 to 4.0. The default is 2.0.

"json_config": {
  "cfg_coef": 3.5
}

Higher values increase similarity to the original voice, but may introduce audio artifacts. Lower values allow the model more flexibility and range with the voice.

Choosing the right value depends on your preference and use case. You can experiment with this parameter in Gradium Studio across several voices.

What Are the json_config Parameters for STT?

json_config also works for Speech-to-Text. Pass it in the setup message alongside your input format:

{
  "type": "setup",
  "model_name": "default",
  "input_format": "pcm",
  "json_config": {
    "language": "en",
    "delay_in_frames": 10
  }
}

language: How Do You Improve Transcription Accuracy?

If you know the language being spoken, include it for better accuracy. If not specified, the model will figure it out on its own.

delay_in_frames: How Do You Tune Responsiveness vs. Accuracy?

delay_in_frames specifies the latency in terms of audio frames. Each audio frame is 80ms. Adjusting this value lets you tune responsiveness versus accuracy in streaming transcription.

Summary: json_config Parameters at a Glance

Parameter	Scope	Default	What it controls
`rewrite_rules`	TTS	none	Text normalization before synthesis (dates, numbers, emails, URLs)
`padding_bonus`	TTS	`0`	Speech delivery speed (positive = slower, negative = faster)
`temp`	TTS	`0.7`	Variation and expressiveness (range: 0 to 1.4)
`cfg_coef`	TTS	`2.0`	Voice similarity to original (range: 1.0 to 4.0)
`language`	STT	auto-detected	Language of the audio for better transcription accuracy
`delay_in_frames`	STT	not specified	Latency in audio frames (1 frame = 80ms); tunes responsiveness vs. accuracy