API Reference
Overview
Real-time talking head synthesis API. Send audio, receive synchronized video frames.
Connection Flow
Basic Flow
Connect → Receive
SessionReadySend audio chunks via
InteractionInput(binary)Receive video frames via
InteractionResponse(binary)End with
EndInteraction(JSON)
WebSocket Handshake
Client connects to the WS endpoint providing an API key in the Authorization header and a config_id query param. The server upgrades to WebSocket.
Raw API key as used by client (no 'Bearer ').
Configuration ID for the persona (e.g., oris config).
Raw API key used by client (no 'Bearer ' prefix).
<API_KEY>Switching Protocols: WebSocket upgrade
Unauthorized (invalid or missing API key)
No content
Messages Reference
Session Messages
Interaction Messages
Error Messages
Message Details
InteractionInput (Client → Server)
Binary message for sending audio data.
Structure:
Audio Requirements:
Format: PCM int16 (signed 16-bit integers)
Sample Rate: 16kHz
Channels: Mono
Max Size: Entire message must be under 512KB
Optional Parameters (JSON):
speech_filter_amount: Smoothing for speech frames (higher = smoother)idle_filter_amount: Smoothing for idle framesidle_mouth_opening_scale: Mouth movement scale during idle (0.0 = closed)speech_mouth_opening_scale: Mouth movement scale during speech (1.0 = full)client_frame_index: Current frame index client is displaying (for smooth transitions)
Recommended Streaming Pattern:
Send audio in 400ms chunks (6400 samples = 12800 bytes at 16kHz) for optimal frame generation.
InteractionResponse (Server → Client)
Binary message containing video frame and synchronized audio.
Structure:
Interaction ID:
00000000-0000-0000-0000-000000000000: Idle frames (persona at rest)Any other UUID: Speech frames (generated from your audio)
Payload Types:
Type 1 (audio): PCM int16, 16kHz mono, typically 640 bytes (40ms at 25fps)
Type 2 (image): JPEG-encoded image (resolution depends on config, e.g., 1280x720)
Playback Guidelines:
The server generates frames at 25 fps (one frame every 40ms). For smooth playback:
Queue received frames in a buffer
Play at exactly 25 fps (40ms per frame)
Cache idle frames and loop during silence
Switch to speech frames when available
Stop when
is_final_response: true
EndInteraction vs CancelInteraction
EndInteraction
Graceful finish
Completes processing, sends all remaining frames with final frame marked is_final_response: true
Normal conversation end
CancelInteraction
Immediate stop
Stops processing immediately, discards remaining frames
User interruption, cancel request
Rate Limits & Constraints
Rate Limit: 6 requests per second (rps)
Max Message Size: 512KB per message
Video Output: 25 fps (frames bundled in groups of 10)
Exceeding limits results in ErrorResponse or request buffering.
Error Codes
Common error codes returned in ErrorResponse:
AUTH_FAILED
Invalid API key
INVALID_PERSONA_ID_CONFIGURATION
Config ID not found or invalid
FAILED_CREATE_MODEL
Server couldn't load persona model
FRAME_SIZE_EXCEEDED
Message exceeded 512KB limit
INVALID_INTERACTION_ID
Interaction ID mismatch or invalid
NO_BACKEND_SERVER_AVAILABLE
Service temporarily unavailable
RATE_LIMITED
Too many requests
TIMEOUT
Operation exceeded processing time
INTERNAL_ERROR
Unexpected server error
Best Practices
1. Audio Chunking
Send audio in 400ms chunks (6400 samples at 16kHz) for smooth frame generation.
2. Buffer Management
Queue at least 10 frames before starting playback
Play at exactly 25 fps to avoid jitter
Implement frame interpolation if needed
3. Idle Frame Handling
Cache idle frames when you first connect
Loop idle frames during silence
Transition smoothly to speech frames using
client_frame_indexparameter
4. Error Handling
Always handle
ErrorResponsemessagesImplement exponential backoff for reconnection
Log errors with
interaction_idfor debugging
5. Rate Limiting
Don't exceed 6 requests per second
Batch audio chunks if needed
Monitor server
loadinSessionReadymessage
6. Interruption Handling
Use
CancelInteractionfor immediate stopsUse
EndInteractionfor graceful endingsClear frame buffers on interruption
7. Frame Synchronization
The server pre-bundles audio with video frames - no client-side sync needed.
Troubleshooting
Complete Example
Last updated
Was this helpful?