API Reference

Overview

Real-time talking head synthesis API. Send speech audio, receive synchronized video and audio frames.

After connecting and receiving SessionReady, you should send one frame message with silence to start the interaction on the server, the server then immediately begins streaming video and audio frames at 25fps. When no speech audio has been sent, the server generates silence frames (persona at rest with idle animation). When you send speech audio, the server generates speech frames with lip-synced animation synchronized to your audio.

You only need to send speech audio — no silence, padding, or keep-alive messages are required.

Production deployments: For secure, low-latency video applications, connect to the real-time WebSocket API from a backend server rather than a front-end client (to keep your API key secure and leverage a network transport appropriate for real-time video media delivery under varying network conditions). Typically, WebRTC is used to deliver the final media stream to end users for smooth, reliable, low-latency playback.

How It Works

Connect to the WebSocket endpoint with your API key and config ID
Receive SessionReady — the server has allocated inference resources for your session
Send initial audio message with silence for one frame.
The server starts streaming frames immediately — silence frames with idle animation
Send speech audio whenever it becomes available (e.g., TTS output from your language model) — also buffer it locally for playback
Receive speech frames — the server transitions to lip-synced animation and returns to silence frames when audio runs out
Render video frames at 25fps, dropping excess silence frames to manage buffer size
Start playing your buffered TTS audio when the first speech frame (index == 1) arrives from Ojin — stop when speech ends

Frame Types

Every frame arrives as a binary InteractionResponse containing both a JPEG image and a PCM audio chunk. Frames are always delivered in order. The index field identifies the frame type:

Index

Frame type

Description

0

Silence

Persona at rest with idle animation. Generated automatically when no speech audio is queued

1

Speech

Lip-synced animation generated from your audio input

Faster-than-Realtime Generation

The server generates frames slightly faster than realtime to build a client-side buffer that prevents stuttering during speech. During speech bursts, generation is even faster. This means your frame buffer will grow over time if you don't manage it.

You must drop silence frames to prevent unbounded buffer growth. When your buffer starts growing beyond what you need for smooth playback, skip 1 out of every 2 silence frames (index == 0) until the buffer shrinks back down. Never drop speech frames (index == 1).

The right buffer target depends on your network conditions and latency requirements — start by observing your buffer size during playback and tuning from there. Keep it as low as possible to minimize latency, but high enough to absorb network jitter without starving playback.

# When consuming frames from the buffer:
frame = buffer.popleft()

# If buffer is growing and this is a silence frame, skip every other one
if len(buffer) > target_buffer_size and frame.index == 0:
    skip_counter += 1
    if skip_counter % 2 == 0 and buffer:
        buffer.popleft()  # drop one silence frame

Connection Flow

WebSocket Handshake

Open WebSocket connection

get

Connect to the WebSocket endpoint providing an API key in the Authorization header and a config_id query parameter. The server upgrades the connection to WebSocket and immediately begins streaming frames after sending SessionReady.

Recommended WebSocket settings:

ping_interval: 30 seconds
ping_timeout: 10 seconds

Authorizations

AuthorizationstringRequired

Raw API key (no Bearer prefix).

Query parameters

config_idstringRequired

Configuration ID for the persona, created in the Oris 1.0 tab of the dashboard.

Header parameters

AuthorizationstringRequired

Your raw API key. No Bearer prefix.

Example: your-api-key

Responses

101

WebSocket upgrade successful. After the upgrade, the server sends a SessionReady JSON message and begins streaming binary InteractionResponse frames immediately.

application/json

401

Unauthorized — invalid or missing API key.

application/json

get

GET /realtime/?config_id=text HTTP/1.1
Authorization: your-api-key
Accept: */*

{
  "type": "sessionReady",
  "payload": {
    "trace_id": "123e4567-e89b-12d3-a456-426614174000",
    "status": "success",
    "load": 0.42,
    "timestamp": 1723567893000,
    "parameters": {
      "speech_filter_amount": 5,
      "idle_filter_amount": 1000,
      "idle_mouth_opening_scale": 0,
      "speech_mouth_opening_scale": 1
    }
  }
}

Message Format

Mixed message types: The server sends both JSON (text) and binary messages on the same WebSocket connection. Your client must check the WebSocket frame type to distinguish them:

Text frames (JSON): SessionReady, EndInteraction, CancelInteraction, ErrorResponse
Binary frames: InteractionInput, InteractionResponse

Byte order: All multi-byte integer fields in binary messages use network byte order (big-endian).

Messages Reference

Server → Client Messages

Client → Server Messages

Message Details

InteractionInput (Client → Server, Binary)

Binary message for sending speech audio to the server. Only send speech audio — do not send silence or padding.

Binary structure:

[1 byte ]  Payload type      — uint8, always 1 for audio
[8 bytes]  Timestamp          — uint64, milliseconds since Unix epoch
[4 bytes ]  Params size        — uint32, byte length of the JSON params block (0 if no params)
[N bytes]  Params JSON        — UTF-8 encoded JSON (only present if params size > 0)
[M bytes]  Audio payload      — raw PCM int16 speech audio data

Header fields use big-endian byte order. The PCM audio samples in the payload use little-endian (standard for PCM int16). In Python: struct.pack('!BQI', payload_type, timestamp, params_size).

Audio requirements:

Property

Value

Format

PCM signed 16-bit integers (little-endian samples)

Sample rate

16,000 Hz

Channels

1 (mono)

Max message size

512 KB (entire binary message including header)

Recommended streaming pattern:

Forward speech audio to the server as it arrives from your TTS service — no need to buffer or accumulate before sending (buffering only needed for playback). Send the audio in realtime to achieve a realtime streaming and smooth playback experience.

import struct, json, time

def build_audio_message(audio_bytes):
    """Build a binary InteractionInput message."""
    header = struct.pack('!BQI', 1, int(time.time() * 1000), 0)
    return header + audio_bytes

InteractionResponse (Server → Client, Binary)

Binary message containing a video frame and synchronized audio. The server streams these continuously after SessionReady. Frames always arrive in order.

Binary structure:

[1 byte  ]  Is final flag     — uint8, 1 = last frame for this interaction, 0 = more coming
[16 bytes]  Interaction ID     — UUID bytes (big-endian)
[8 bytes ]  Timestamp          — uint64, milliseconds since Unix epoch
[4 bytes ]  Usage              — uint32, usage metric for this response
[4 bytes ]  Frame index        — uint32, 0 = silence, 1 = speech
[4 bytes ]  Num payloads       — uint32, number of payload entries that follow

For each payload entry:
  [4 bytes]  Data size          — uint32, byte length of the payload data only
  [1 byte ]  Payload type       — uint8, 1 = audio, 2 = image
  [N bytes]  Payload data       — raw payload bytes

All multi-byte integers are big-endian. In Python: struct.unpack('!B16sQIII', header_bytes) for the main header, struct.unpack('!IB', entry_bytes) for each payload entry.

Frame index:

Index

Meaning

0

Silence frame — persona at rest with idle animation

1

Speech frame — lip-synced animation from your audio

Payload types:

Type

Format

Typical size per frame

1 (audio)

PCM int16, 16kHz mono

1,280 bytes (640 samples = 40ms at 25fps)

2 (image)

JPEG-encoded image

Variable (resolution depends on config, e.g. 1280×720)

Parsing example:

import struct, uuid

HEADER_FMT = '!B16sQIII'
HEADER_SIZE = struct.calcsize(HEADER_FMT)   # 37 bytes
ENTRY_FMT = '!IB'
ENTRY_SIZE = struct.calcsize(ENTRY_FMT)     # 5 bytes

def parse_response(data):
    is_final, uuid_bytes, timestamp, usage, index, num_payloads = \
        struct.unpack(HEADER_FMT, data[:HEADER_SIZE])

    offset = HEADER_SIZE
    image = audio = None

    for _ in range(num_payloads):
        size, ptype = struct.unpack(ENTRY_FMT, data[offset:offset + ENTRY_SIZE])
        offset += ENTRY_SIZE
        payload = data[offset:offset + size]
        offset += size

        if ptype == 2:
            image = payload   # JPEG bytes
        elif ptype == 1:
            audio = payload   # PCM int16 bytes

    return {
        'is_final': bool(is_final),
        'index': index,              # 0 = silence, 1 = speech
        'image': image,
        'audio': audio,
    }

EndInteraction vs CancelInteraction

Message

Purpose

Server behavior

Use case

EndInteraction

Graceful finish

Completes processing, sends remaining frames with last marked is_final: true

Session end

CancelInteraction

Immediate stop

Stops processing, discards remaining frames

User interruption

ErrorResponse (Server → Client, JSON)

Plain text errors: In some error conditions (e.g., no backend servers available), the server may send a plain text message instead of a structured JSON ErrorResponse. Your client should handle non-JSON text messages gracefully.

Error codes:

Code

Description

AUTH_FAILED

Invalid API key

UNAUTHORIZED

Caller lacks permission

MISSING_CONFIG_ID

config_id query parameter not provided

INVALID_MESSAGE

Malformed or unsupported message payload

INVALID_HEADERS

Missing or invalid headers

MODEL_NOT_FOUND

Config ID not found or invalid

BACKEND_UNAVAILABLE

No healthy inference backend available

RATE_LIMITED

Too many requests

TIMEOUT

Operation exceeded processing time

CANCELLED

Interaction cancelled by client

INTERNAL_ERROR

Unexpected server error

FRAME_SIZE_EXCEEDED

Message exceeded 512KB limit

Rate Limits & Constraints

Constraint

Value

Rate limit

6 requests per second

Max message size

512 KB per message

Video output

25 fps target (generated slightly faster than realtime)

Exceeding limits results in an ErrorResponse with code RATE_LIMITED.

Best Practices

Audio Input

Send one silence frame first to start the conversation, then send speech audio when available
Forward speech audio to the server as it arrives from your TTS service — no need to buffer or accumulate before sending (buffering only needed for playback). Send the audio in realtime to achieve a realtime streaming and smooth playback experience.

Buffer Management

Play frames at 25 fps (40ms per frame)
The server generates slightly faster than realtime — you must drop silence frames to prevent the buffer from growing
Drop strategy: when the buffer grows beyond your target size, skip 1 out of every 2 silence frames (index == 0) until the buffer shrinks back
Never drop speech frames (index == 1)
Tune your target buffer size based on your network conditions — keep it as low as possible for minimal latency

Audio and Video Synchronization

Each InteractionResponse contains both a JPEG image and a PCM audio chunk. However, the recommended approach for audio playback is:

Buffer your TTS/source audio locally as it arrives from your speech service (for later playback only. You do not need to buffer it to send it to Ojin)
Forward TTS audio to Ojin immediately as it arrives from your speech service
Wait for the first speech frame (index == 1) to arrive from Ojin
Start playing your buffered TTS audio at that moment
Stop audio playback when you receive the first silence frame (index == 0) after speech
Render video from every frame regardless of type

Sending is immediate, but playback is gated on the first speech frame. This ensures audio and video stay in sync.

# When TTS audio arrives from your speech service:
speech_audio_buffer.extend(tts_audio_chunk)      # buffer locally for playback
await ojin.send_audio(tts_audio_chunk)            # send to Ojin for lip-sync

# In your video playback loop:
frame = buffer.popleft()

if frame.index == 1 and not audio_playing:
    start_audio_playback(speech_audio_buffer)     # begin draining the buffer
    audio_playing = True

if frame.index == 0 and audio_playing:
    stop_audio_playback()
    audio_playing = False

render_video(frame.image)                         # always render the video

Error Handling

Handle both JSON ErrorResponse messages and plain text error strings
Implement exponential backoff for reconnection
Monitor server load in the SessionReady message

Interruption Handling

Use CancelInteraction for immediate stops (e.g., user interrupts the bot)
Use EndInteraction for graceful session endings
Clear your frame buffer on interruption

Complete Example

import asyncio
import json
import struct
import time
from collections import deque
import numpy as np
import websockets
from dotenv import load_dotenv
import os

load_dotenv()

API_KEY = os.getenv("OJIN_API_KEY", "")
CONFIG_ID = os.getenv("OJIN_CONFIG_ID", "")
URL = f"wss://models.ojin.ai/realtime?config_id={CONFIG_ID}"

SAMPLE_RATE = 16000
FPS = 25
TARGET_BUFFER = 10  # Tune based on your network conditions

def build_audio_message(audio_bytes):
    """Build a binary InteractionInput message."""
    header = struct.pack('!BQI', 1, int(time.time() * 1000), 0)
    return header + audio_bytes

def parse_response(data):
    """Parse a binary InteractionResponse message."""
    fmt = '!B16sQIII'
    hdr_size = struct.calcsize(fmt)
    is_final, uuid_bytes, timestamp, usage, index, num_payloads = \
        struct.unpack(fmt, data[:hdr_size])

    offset = hdr_size
    image = audio = None

    for _ in range(num_payloads):
        size, ptype = struct.unpack('!IB', data[offset:offset + 5])
        offset += 5
        if ptype == 2:
            image = data[offset:offset+size]
        elif ptype == 1:
            audio = data[offset:offset+size]
        offset += size

    return {
        'is_final': bool(is_final),
        'index': index,              # 0 = silence, 1 = speech
        'image': image,
        'audio': audio,
    }

async def main():
    headers = {"Authorization": API_KEY}
    # For older websockets versions, use extra_headers instead.
    async with websockets.connect(URL, additional_headers=headers, ping_interval=30) as ws:
        # 1. Wait for SessionReady — server starts streaming frames immediately after
        msg = json.loads(await ws.recv())
        assert msg["type"] == "sessionReady"
        print(f"Session ready: {msg['payload']}")

        buffer = deque()
        skip_counter = 0
        playback_started = False
        frame_count = 0
        audio_sent = False

        # 2. Send one silence frame to start processing
        audio_data = b"\00\00" * (SAMPLE_RATE // 25)
        await ws.send(build_audio_message(audio_data))

        # 3. Receive and process frames
        async for data in ws:
            if isinstance(data, str):
                msg = json.loads(data)
                if msg.get("type") == "errorResponse":
                    print(f"Error: {msg['payload']}")
                    break
                continue

            frame = parse_response(data)
            buffer.append(frame)
            frame_count += 1

            # Wait for initial buffer before playback
            if not playback_started:
                if len(buffer) >= TARGET_BUFFER:
                    playback_started = True
                    print(f"Buffer filled ({TARGET_BUFFER} frames), starting playback")
                continue

            # Consume one frame
            if buffer:
                play_frame = buffer.popleft()

                # Drop excess silence frames: skip 1 out of 2 when buffer is too large
                if len(buffer) > TARGET_BUFFER and play_frame['index'] == 0:
                    skip_counter += 1
                    if skip_counter % 2 == 0 and len(buffer) > 0:
                        buffer.popleft()  # drop one silence frame

                kind = "silence" if play_frame['index'] == 0 else "speech"
                print(f"[{kind}] frame #{frame_count}, buffer={len(buffer)}")

                # In a real app: render play_frame['image'] and play play_frame['audio']

            # Demo: send speech audio after receiving some silence frames
            if frame_count == 50 and not audio_sent:
                t = np.linspace(0, 1.0, SAMPLE_RATE, endpoint=False)
                audio_data = (32767 * 0.5 * np.sin(2 * np.pi * 440 * t)).astype(np.int16)
                chunk_size = SAMPLE_RATE # 1s chunks
                for i in range(0, len(audio_data), chunk_size):
                    chunk = audio_data[i:i + chunk_size]
                    await ws.send(build_audio_message(chunk.tobytes()))
                audio_sent = True
                print("Sent 1 second of speech audio")

            if frame_count > 200:
                break

asyncio.run(main())

Troubleshooting

Connection Issues

✓ Verify API key and config ID
✓ Check that config exists in dashboard
✓ Ensure network allows WebSocket connections (port 443)
✓ Check the Authorization header uses the raw API key (no Bearer prefix)

No Frames Received

✓ Confirm you received SessionReady — frames start streaming immediately after
✓ If sending speech audio: verify format is 16kHz, int16, mono with big-endian message header
✓ Check message size < 512KB

Choppy Playback

✓ Play at 25fps (40ms per frame)
✓ Buffer some frames before starting playback
✓ Check network latency and jitter

Growing Latency

✓ You must drop silence frames — the server generates faster than realtime
✓ Skip 1 out of 2 silence frames (index == 0) when buffer grows beyond your target
✓ Never drop speech frames (index == 1)
✓ During speech bursts the buffer will grow temporarily — this is expected, trim silence frames afterward

Frame Lag During Speech

✓ Reduce speech_filter_amount parameter (lower = more responsive, less smooth)

PreviousCreating a custom Persona Nextojin/flashhead-lite-1.0

Last updated 3 days ago

Was this helpful?

hashtagOverview

hashtagHow It Works

hashtagFrame Types

hashtagFaster-than-Realtime Generation

hashtagConnection Flow

hashtagWebSocket Handshake

hashtagOpen WebSocket connection

hashtagMessage Format

hashtagMessages Reference

hashtagServer → Client Messages

hashtagClient → Server Messages

hashtagMessage Details

hashtagInteractionInput (Client → Server, Binary)

hashtagInteractionResponse (Server → Client, Binary)

hashtagEndInteraction vs CancelInteraction

hashtagErrorResponse (Server → Client, JSON)

hashtagRate Limits & Constraints

hashtagBest Practices

hashtagAudio Input

hashtagBuffer Management

hashtagAudio and Video Synchronization

hashtagError Handling

hashtagInterruption Handling

hashtagComplete Example

hashtagTroubleshooting

hashtagConnection Issues

hashtagNo Frames Received

hashtagChoppy Playback

hashtagGrowing Latency

hashtagFrame Lag During Speech

Overview

How It Works

Frame Types

Faster-than-Realtime Generation

Connection Flow

WebSocket Handshake

Open WebSocket connection

Message Format

Messages Reference

Server → Client Messages

Client → Server Messages

Message Details

InteractionInput (Client → Server, Binary)

InteractionResponse (Server → Client, Binary)

EndInteraction vs CancelInteraction

ErrorResponse (Server → Client, JSON)

Rate Limits & Constraints

Best Practices

Audio Input

Buffer Management

Audio and Video Synchronization

Error Handling

Interruption Handling

Complete Example

Troubleshooting

Connection Issues

No Frames Received

Choppy Playback

Growing Latency

Frame Lag During Speech