API Reference

Overview

Real-time talking head synthesis API. Send audio, receive synchronized video frames.

Connection Flow

Basic Flow

Connect → Receive SessionReady
Send audio chunks via InteractionInput (binary)
Receive video frames via InteractionResponse (binary)
End with EndInteraction (JSON)

WebSocket Handshake

Open WebSocket connection (handshake)

get

Client connects to the WS endpoint providing an API key in the Authorization header and a config_id query param. The server upgrades to WebSocket.

Authorizations

AuthorizationstringRequired

Raw API key as used by client (no 'Bearer ').

Query parameters

config_idstringRequired

Configuration ID for the persona (e.g., oris config).

Header parameters

AuthorizationstringRequired

Raw API key used by client (no 'Bearer ' prefix).

Example: <API_KEY>

Responses

101

Switching Protocols: WebSocket upgrade

401

Unauthorized (invalid or missing API key)

get

GET /realtime/?config_id=text HTTP/1.1
Authorization: <API_KEY>
Accept: */*

No content

Messages Reference

Session Messages

Interaction Messages

Error Messages

Message Details

InteractionInput (Client → Server)

Binary message for sending audio data.

Structure:

[1 byte]    Payload type (1 = audio)
[8 bytes]   Timestamp (uint64, milliseconds since Unix epoch)
[4 bytes]   Params size (uint32)
[N bytes]   Params JSON (if params_size > 0)
[M bytes]   Audio payload

Audio Requirements:

Format: PCM int16 (signed 16-bit integers)
Sample Rate: 16kHz
Channels: Mono
Max Size: Entire message must be under 512KB

Optional Parameters (JSON):

{
  "speech_filter_amount": 1000.0,
  "idle_filter_amount": 1000.0,
  "idle_mouth_opening_scale": 0.0,
  "speech_mouth_opening_scale": 1.0,
  "client_frame_index": 0
}

speech_filter_amount: Smoothing for speech frames (higher = smoother)
idle_filter_amount: Smoothing for idle frames
idle_mouth_opening_scale: Mouth movement scale during idle (0.0 = closed)
speech_mouth_opening_scale: Mouth movement scale during speech (1.0 = full)
client_frame_index: Current frame index client is displaying (for smooth transitions)

Recommended Streaming Pattern:

Send audio in 400ms chunks (6400 samples = 12800 bytes at 16kHz) for optimal frame generation.

sample_rate = 16000
chunk_duration = 0.4  # seconds
chunk_samples = int(sample_rate * chunk_duration)  # 6400 samples

for i in range(0, len(audio_samples), chunk_samples):
    chunk = audio_samples[i:i + chunk_samples]
    await send_interaction_input(ws, chunk)
    await asyncio.sleep(0.01)  # Throttle

InteractionResponse (Server → Client)

Binary message containing video frame and synchronized audio.

Structure:

[1 byte]     Is final flag (1 = last frame, 0 = more coming)
[16 bytes]   Interaction ID (UUID)
[8 bytes]    Timestamp (uint64, milliseconds)
[4 bytes]    Usage (uint32)
[4 bytes]    Frame index (uint32)
[4 bytes]    Number of payload entries (uint32)

For each payload entry:
  [4 bytes]  Payload size (uint32)
  [1 byte]   Payload type (1 = audio, 2 = image)
  [N bytes]  Payload data

Interaction ID:

00000000-0000-0000-0000-000000000000: Idle frames (persona at rest)
Any other UUID: Speech frames (generated from your audio)

Payload Types:

Type 1 (audio): PCM int16, 16kHz mono, typically 640 bytes (40ms at 25fps)
Type 2 (image): JPEG-encoded image (resolution depends on config, e.g., 1280x720)

Playback Guidelines:

The server generates frames at 25 fps (one frame every 40ms). For smooth playback:

Queue received frames in a buffer
Play at exactly 25 fps (40ms per frame)
Cache idle frames and loop during silence
Switch to speech frames when available
Stop when is_final_response: true

EndInteraction vs CancelInteraction

Message

Purpose

Server Behavior

Use Case

EndInteraction

Graceful finish

Completes processing, sends all remaining frames with final frame marked is_final_response: true

Normal conversation end

CancelInteraction

Immediate stop

Stops processing immediately, discards remaining frames

User interruption, cancel request

Rate Limits & Constraints

Rate Limit: 6 requests per second (rps)
Max Message Size: 512KB per message
Video Output: 25 fps (frames bundled in groups of 10)

Exceeding limits results in ErrorResponse or request buffering.

Error Codes

Common error codes returned in ErrorResponse:

Code

Description

AUTH_FAILED

Invalid API key

INVALID_PERSONA_ID_CONFIGURATION

Config ID not found or invalid

FAILED_CREATE_MODEL

Server couldn't load persona model

FRAME_SIZE_EXCEEDED

Message exceeded 512KB limit

INVALID_INTERACTION_ID

Interaction ID mismatch or invalid

NO_BACKEND_SERVER_AVAILABLE

Service temporarily unavailable

RATE_LIMITED

Too many requests

TIMEOUT

Operation exceeded processing time

INTERNAL_ERROR

Unexpected server error

Best Practices

1. Audio Chunking

Send audio in 400ms chunks (6400 samples at 16kHz) for smooth frame generation.

2. Buffer Management

Queue at least 10 frames before starting playback
Play at exactly 25 fps to avoid jitter
Implement frame interpolation if needed

3. Idle Frame Handling

Cache idle frames when you first connect
Loop idle frames during silence
Transition smoothly to speech frames using client_frame_index parameter

4. Error Handling

Always handle ErrorResponse messages
Implement exponential backoff for reconnection
Log errors with interaction_id for debugging

5. Rate Limiting

Don't exceed 6 requests per second
Batch audio chunks if needed
Monitor server load in SessionReady message

6. Interruption Handling

Use CancelInteraction for immediate stops
Use EndInteraction for graceful endings
Clear frame buffers on interruption

7. Frame Synchronization

The server pre-bundles audio with video frames - no client-side sync needed.

Troubleshooting

Complete Example

import asyncio
import json
import struct
import time
import numpy as np
import websockets

API_KEY = "your-api-key"
CONFIG_ID = "your-config-id"
URL = f"wss://models.ojin.ai/realtime?config_id={CONFIG_ID}"

async def send_audio_chunk(ws, audio_int16_bytes, params=None):
    """Send audio chunk to server."""
    params_bytes = b""
    if params:
        params_bytes = json.dumps(params).encode('utf-8')
    
    header = struct.pack(
        '!BQI',
        1,  # Audio payload type
        int(time.time() * 1000),
        len(params_bytes)
    )
    
    await ws.send(header + params_bytes + audio_int16_bytes)

async def receive_frame(ws):
    """Receive and parse video frame."""
    data = await ws.recv()
    
    if isinstance(data, str):
        # JSON message (SessionReady, Error, etc)
        return json.loads(data)
    
    # Binary frame
    header_fmt = '!B16sQIII'
    header_size = struct.calcsize(header_fmt)
    
    (is_final, interaction_id_bytes, timestamp, 
     usage, index, num_payloads) = struct.unpack(header_fmt, data[:header_size])
    
    offset = header_size
    video_frame = None
    audio_chunk = None
    
    for _ in range(num_payloads):
        size, payload_type = struct.unpack('!IB', data[offset:offset+5])
        offset += 5
        payload_data = data[offset:offset+size]
        offset += size
        
        if payload_type == 2:
            video_frame = payload_data
        elif payload_type == 1:
            audio_chunk = payload_data
    
    return {
        'video_frame': video_frame,
        'audio_chunk': audio_chunk,
        'is_final': bool(is_final),
        'index': index
    }

async def main():
    headers = {"Authorization": API_KEY}
    
    async with websockets.connect(URL, extra_headers=headers) as ws:
        # Wait for SessionReady
        session_ready = await receive_frame(ws)
        print(f"Session ready: {session_ready}")
        
        # Generate sample audio (1 second, 440Hz sine wave)
        sample_rate = 16000
        duration = 1.0
        t = np.linspace(0, duration, int(sample_rate * duration), endpoint=False)
        audio = (32767 * 0.5 * np.sin(2 * np.pi * 440 * t)).astype(np.int16)
        
        # Send audio in chunks
        chunk_size = 6400  # 400ms chunks
        for i in range(0, len(audio), chunk_size):
            chunk = audio[i:i + chunk_size]
            await send_audio_chunk(ws, chunk.tobytes())
            await asyncio.sleep(0.01)  # Throttle sends
        
        # End interaction
        end_message = {
            "type": "endInteraction",
            "payload": {"timestamp": int(time.time() * 1000)}
        }
        await ws.send(json.dumps(end_message))
        
        # Receive frames until final
        while True:
            frame = await receive_frame(ws)
            if isinstance(frame, dict) and frame.get('is_final'):
                print("Received final frame")
                break
            print(f"Received frame {frame.get('index', 'N/A')}")

if __name__ == "__main__":
    asyncio.run(main())

PreviousCreating a custom Persona NextTroubleshooting

Last updated 1 month ago

Was this helpful?

hashtagOverview

hashtagConnection Flow

hashtagBasic Flow

hashtagWebSocket Handshake

hashtagOpen WebSocket connection (handshake)

hashtagMessages Reference

hashtagSession Messages

hashtagInteraction Messages

hashtagError Messages

hashtagMessage Details

hashtagInteractionInput (Client → Server)

hashtagInteractionResponse (Server → Client)

hashtagEndInteraction vs CancelInteraction

hashtagRate Limits & Constraints

hashtagError Codes

hashtagBest Practices

hashtag1. Audio Chunking

hashtag2. Buffer Management

hashtag3. Idle Frame Handling

hashtag4. Error Handling

hashtag5. Rate Limiting

hashtag6. Interruption Handling

hashtag7. Frame Synchronization

hashtagTroubleshooting

hashtagComplete Example