API Reference

Overview

Real-time text-to-speech synthesis API. Send text, receive streaming PCM audio chunks.

After connecting and receiving SessionReady, send your text as a binary InteractionInput message followed by a JSON EndInteraction. The server synthesizes speech and streams back audio chunks as binary InteractionResponse messages. The final chunk has is_final set to true.

Production deployments: Connect to the real-time WebSocket API from a backend server to keep your API key secure.


How It Works

  1. Connect to the WebSocket endpoint with your API key and config ID

  2. Receive SessionReady — the server has allocated inference resources for your session

  3. Send text as a binary InteractionInput message (payload type 0 for text)

  4. Send EndInteraction (JSON) to signal that input is complete and synthesis should begin

  5. Receive audio chunks — binary InteractionResponse messages containing PCM int16 audio at 24 kHz

  6. Detect completion — the last response has is_final: true

Audio Output Format

Property
Value

Format

PCM signed 16-bit integers (little-endian samples)

Sample rate

24,000 Hz

Channels

1 (mono)

Bits per sample

16


Connection Flow


WebSocket Handshake

Open WebSocket connection

get

Connect to the WebSocket endpoint providing an API key in the Authorization header and a config_id query parameter. The server upgrades the connection to WebSocket. After sending SessionReady, the server waits for text input.

Recommended WebSocket settings:

  • open_timeout: None (model loading may take time on cold start)

  • ping_timeout: None

Authorizations
AuthorizationstringRequired

Raw API key (no Bearer prefix).

Query parameters
config_idstringRequired

Configuration ID for the TTS voice, created via API or in the Oris Voice tab of the dashboard.

Header parameters
AuthorizationstringRequired

Your raw API key. No Bearer prefix.

Example: your-api-key
Responses
get
/

Message Format

Mixed message types: The server sends both JSON (text) and binary messages on the same WebSocket connection. Your client must check the WebSocket frame type to distinguish them:

  • Text frames (JSON): SessionReady, EndInteraction, CancelInteraction, ErrorResponse

  • Binary frames: InteractionInput, InteractionResponse

Byte order: All multi-byte integer fields in binary messages use network byte order (big-endian).


Messages Reference

Server -> Client Messages

Client -> Server Messages


Message Details

InteractionInput (Client -> Server, Binary)

Binary message for sending text to the server.

Binary structure:

Header fields use big-endian byte order. In Python: struct.pack('!BQI', payload_type, timestamp, params_size).

Text requirements:

Property
Value

Encoding

UTF-8

Payload type

0 (text)

Max message size

512 KB (entire binary message including header)

Example:


InteractionResponse (Server -> Client, Binary)

Binary message containing an audio chunk. The server streams these after receiving text input and EndInteraction.

Binary structure:

All multi-byte integers are big-endian. In Python: struct.unpack('!B16sQIII', header_bytes) for the main header, struct.unpack('!IB', entry_bytes) for each payload entry.

Payload types:

Type
Format
Description

1 (audio)

PCM int16, 24 kHz mono

Streaming audio chunk (variable size)

Parsing example:


EndInteraction vs CancelInteraction

Message
Purpose
Server behavior
Use case

EndInteraction

Graceful finish

Completes synthesis, sends remaining chunks with last marked is_final: true

Normal completion after sending text

CancelInteraction

Immediate stop

Stops synthesis, discards remaining audio

User interruption or abort


ErrorResponse (Server -> Client, JSON)

Error codes:

Code
Description

AUTH_FAILED

Invalid API key

UNAUTHORIZED

Caller lacks permission

MISSING_CONFIG_ID

config_id query parameter not provided

INVALID_MESSAGE

Malformed or unsupported message payload

INVALID_HEADERS

Missing or invalid headers

MODEL_NOT_FOUND

Config ID not found or invalid

BACKEND_UNAVAILABLE

No healthy inference backend available

RATE_LIMITED

Too many requests

TIMEOUT

Operation exceeded processing time

CANCELLED

Interaction cancelled by client

INTERNAL_ERROR

Unexpected server error

FRAME_SIZE_EXCEEDED

Message exceeded 512 KB limit


Rate Limits & Constraints

Constraint
Value

Max message size

512 KB per message

Max generation length

~30 seconds per interaction (360 tokens at 12 Hz)

Exceeding limits results in an ErrorResponse with the appropriate code.


Best Practices

Text Input

  • Send the full text in a single InteractionInput message, then immediately send EndInteraction

  • The model handles sentence segmentation and streaming internally

  • For very long texts, consider splitting into sentences and making separate requests

Streaming Playback

  • Process audio chunks as they arrive for lowest perceived latency

  • Buffer a small amount (2--3 chunks) before starting playback to absorb network jitter

  • The server generates audio faster than realtime, so chunks will arrive ahead of playback

Error Handling

  • Handle both JSON ErrorResponse messages and plain text error strings

  • Implement exponential backoff for reconnection

  • Check the SessionReady message before sending any input


Complete Example


Troubleshooting

Connection Issues

  • Verify API key and config ID

  • Check that the config exists in the dashboard

  • Ensure network allows WebSocket connections (port 443)

  • The Authorization header uses the raw API key (no Bearer prefix)

No Audio Received

  • Confirm you received SessionReady before sending text

  • Make sure you send EndInteraction after the text input — synthesis does not start until the server receives it

  • Check message size is under 512 KB

Audio Quality Issues

  • Verify the output is written as 24 kHz, 16-bit mono WAV

  • Check the language parameter matches your input text, or use "Auto"

  • For voice cloning, ensure the reference audio is clean and at least 5 seconds long

Unexpected Silence or Truncation

  • Check max_new_tokens — the default (360) caps output at ~30 seconds

  • If the text is very long, consider splitting into smaller segments


Last updated

Was this helpful?