API Reference

Overview

Real-time talking head synthesis API. Send audio, receive synchronized video frames.

Connection Flow

Basic Flow

  1. Connect → Receive SessionReady

  2. Send audio chunks via InteractionInput (binary)

  3. Receive video frames via InteractionResponse (binary)

  4. End with EndInteraction (JSON)


WebSocket Handshake

Open WebSocket connection (handshake)

get

Client connects to the WS endpoint providing an API key in the Authorization header and a config_id query param. The server upgrades to WebSocket.

Authorizations
AuthorizationstringRequired

Raw API key as used by client (no 'Bearer ').

Query parameters
config_idstringRequired

Configuration ID for the persona (e.g., oris config).

Header parameters
AuthorizationstringRequired

Raw API key used by client (no 'Bearer ' prefix).

Example: <API_KEY>
Responses
get
/

No content


Messages Reference

Session Messages

Interaction Messages

Error Messages


Message Details

InteractionInput (Client → Server)

Binary message for sending audio data.

Structure:

Audio Requirements:

  • Format: PCM int16 (signed 16-bit integers)

  • Sample Rate: 16kHz

  • Channels: Mono

  • Max Size: Entire message must be under 512KB

Optional Parameters (JSON):

  • speech_filter_amount: Smoothing for speech frames (higher = smoother)

  • idle_filter_amount: Smoothing for idle frames

  • idle_mouth_opening_scale: Mouth movement scale during idle (0.0 = closed)

  • speech_mouth_opening_scale: Mouth movement scale during speech (1.0 = full)

  • client_frame_index: Current frame index client is displaying (for smooth transitions)

Recommended Streaming Pattern:

Send audio in 400ms chunks (6400 samples = 12800 bytes at 16kHz) for optimal frame generation.


InteractionResponse (Server → Client)

Binary message containing video frame and synchronized audio.

Structure:

Interaction ID:

  • 00000000-0000-0000-0000-000000000000: Idle frames (persona at rest)

  • Any other UUID: Speech frames (generated from your audio)

Payload Types:

  • Type 1 (audio): PCM int16, 16kHz mono, typically 640 bytes (40ms at 25fps)

  • Type 2 (image): JPEG-encoded image (resolution depends on config, e.g., 1280x720)

Playback Guidelines:

The server generates frames at 25 fps (one frame every 40ms). For smooth playback:

  1. Queue received frames in a buffer

  2. Play at exactly 25 fps (40ms per frame)

  3. Cache idle frames and loop during silence

  4. Switch to speech frames when available

  5. Stop when is_final_response: true


EndInteraction vs CancelInteraction

Message
Purpose
Server Behavior
Use Case

EndInteraction

Graceful finish

Completes processing, sends all remaining frames with final frame marked is_final_response: true

Normal conversation end

CancelInteraction

Immediate stop

Stops processing immediately, discards remaining frames

User interruption, cancel request


Rate Limits & Constraints

  • Rate Limit: 6 requests per second (rps)

  • Max Message Size: 512KB per message

  • Video Output: 25 fps (frames bundled in groups of 10)

Exceeding limits results in ErrorResponse or request buffering.


Error Codes

Common error codes returned in ErrorResponse:

Code
Description

AUTH_FAILED

Invalid API key

INVALID_PERSONA_ID_CONFIGURATION

Config ID not found or invalid

FAILED_CREATE_MODEL

Server couldn't load persona model

FRAME_SIZE_EXCEEDED

Message exceeded 512KB limit

INVALID_INTERACTION_ID

Interaction ID mismatch or invalid

NO_BACKEND_SERVER_AVAILABLE

Service temporarily unavailable

RATE_LIMITED

Too many requests

TIMEOUT

Operation exceeded processing time

INTERNAL_ERROR

Unexpected server error


Best Practices

1. Audio Chunking

Send audio in 400ms chunks (6400 samples at 16kHz) for smooth frame generation.

2. Buffer Management

  • Queue at least 10 frames before starting playback

  • Play at exactly 25 fps to avoid jitter

  • Implement frame interpolation if needed

3. Idle Frame Handling

  • Cache idle frames when you first connect

  • Loop idle frames during silence

  • Transition smoothly to speech frames using client_frame_index parameter

4. Error Handling

  • Always handle ErrorResponse messages

  • Implement exponential backoff for reconnection

  • Log errors with interaction_id for debugging

5. Rate Limiting

  • Don't exceed 6 requests per second

  • Batch audio chunks if needed

  • Monitor server load in SessionReady message

6. Interruption Handling

  • Use CancelInteraction for immediate stops

  • Use EndInteraction for graceful endings

  • Clear frame buffers on interruption

7. Frame Synchronization

The server pre-bundles audio with video frames - no client-side sync needed.


Troubleshooting


Complete Example


Last updated

Was this helpful?