How to Multiplex TTS Requests Over One WebSocket Connection in Gradium
When you stream TTS over WebSocket, the standard flow is straightforward: open a connection, send your text, receive audio back, and the connection closes when the request finishes. That works well for many applications.
But for some use cases, like a voice agent producing short responses back to back, opening a new WebSocket for each generation adds overhead and complexity. The solution is multiplexing: reusing a single WebSocket connection to handle multiple independent TTS requests over it.
This guide explains how multiplexing works in Gradium using the Python realtime client.
What Is Multiplexing in TTS?
Multiplexing means sending multiple independent TTS requests over the same persistent WebSocket connection, rather than opening and closing a new connection for each one. Each request runs as its own generation, and their audio chunks can arrive interleaved over the same channel.
This avoids the overhead of multiple WebSocket handshakes and removes the burden of managing multiple simultaneous connections.
How Does Multiplexing Work in Gradium?
Step 1: How Do You Open a Single WebSocket Connection?
Start by opening one WebSocket connection. In this pattern, two asynchronous tasks run concurrently over that connection: one that sends requests and one that receives audio.
Step 2: How Do You Disable Auto-Close with close_ws_on_eos?
By default, the server closes the WebSocket when a generation is complete. To reuse the connection across multiple generations, you need to tell the server not to do this.
In your setup message, in addition to the regular voice and output format parameters (and any json_config you may need), set close_ws_on_eos to false:
{
"type": "setup",
"voice_id": "<YOUR_VOICE_ID>",
"output_format": "wav",
"close_ws_on_eos": false
}
This keeps the connection open after each generation finishes, so it can be reused for the next one.
Step 3: How Do You Tag Every Message with a client_request_id?
Since multiple sessions transit through the same WebSocket, you need a way to know which message belongs to which generation. For every message you send, including the setup message and the end_of_stream signal, assign a unique client_request_id.
This can be any string. In the example below, the two generations use "blue" and "red" as identifiers.
Step 4: How Do You Route Incoming Audio by client_req_id?
On the receiving side, listen for incoming messages and check the client_req_id field on each one:
- If it is
"blue", the chunk belongs to the first generation. - If it is
"red", it belongs to the second one.
This keeps the audio for each request cleanly separated, even if chunks from different generations arrive interleaved over the same connection.
When you receive an end_of_stream message for a given client_req_id, that generation is complete.
Step 5: How Do You Disconnect Once All Generations Are Done?
Once all your generations are complete, disconnect the WebSocket. The disconnect is triggered when the number of complete generations matches the expected total.
Why Use Multiplexing?
| Standard flow | Multiplexing | |
|---|---|---|
| WebSocket connections | One per generation | One for all generations |
| Handshake overhead | Repeated for each request | Paid once |
| Connection management | Multiple connections to handle | Single connection to manage |
| Best for | Single or infrequent requests | Voice agents with back-to-back responses |
Summary: Key Multiplexing Parameters at a Glance
| Parameter | Where it goes | What it does |
|---|---|---|
close_ws_on_eos |
Setup message | Set to false to keep the connection open after a generation ends |
client_request_id |
Every sent message (setup, end_of_stream) | Unique string that tags messages belonging to the same generation |
client_req_id |
Every received message | Identifies which generation an incoming audio chunk belongs to |
Frequently Asked Questions
- What is WebSocket multiplexing in TTS?
- It is the pattern of sending multiple independent TTS requests over a single persistent WebSocket connection, rather than opening a new connection for each generation. In Gradium, this is done using close_ws_on_eos: false and unique client_request_id values per request.
- Why would I use multiplexing instead of separate WebSocket connections?
- Multiplexing avoids the overhead of repeated WebSocket handshakes and removes the need to manage multiple simultaneous connections. It is particularly useful for voice agents that produce short responses back to back.
- What does close_ws_on_eos do?
- By default, Gradium closes the WebSocket when a generation reaches its end of stream. Setting close_ws_on_eos to false in the setup message prevents this, keeping the connection available for the next generation.
- What is client_request_id used for?
- It is a unique string you assign to every message you send (including setup and end_of_stream). On the receiving side, the corresponding client_req_id field on incoming messages lets you route each audio chunk to the correct generation, even when chunks from multiple generations arrive interleaved.
- Can client_request_id be any string?
- Yes. It can be any string you choose, as long as it is unique per generation within the session. The example in the tutorial uses "blue" and "red".
- How do I know when a generation is complete?
- When you receive an end_of_stream message with a given client_req_id, that generation is over. Once all expected generations are complete, you disconnect the WebSocket.
- What Python client is used for multiplexing in Gradium?
- Multiplexing is done using the Gradium Python realtime client, with two concurrent asynchronous tasks: one for sending requests and one for receiving audio.