# API Reference

## Overview

Real-time talking head synthesis API. Send speech audio, receive synchronized video and audio frames.

After connecting and receiving `SessionReady`, you should send one frame message with silence to start the interaction on the server, the server then immediately begins streaming video and audio frames at 25fps. When no speech audio has been sent, the server generates **silence frames** (persona at rest with idle animation). When you send speech audio, the server generates **speech frames** with lip-synced animation synchronized to your audio.

You only need to send speech audio — no silence, padding, or keep-alive messages are required.

{% hint style="info" %}
**Production deployments:** For secure, low-latency video applications, connect to the real-time WebSocket API from a backend server rather than a front-end client (to keep your API key secure and leverage a network transport appropriate for real-time video media delivery under varying network conditions). Typically, WebRTC is used to deliver the final media stream to end users for smooth, reliable, low-latency playback.
{% endhint %}

***

## How It Works

1. **Connect** to the WebSocket endpoint with your API key and config ID
2. **Receive `SessionReady`** — the server has allocated inference resources for your session
3. **Send** initial audio message with silence for one frame.
4. **The server starts streaming frames immediately** — silence frames with idle animation
5. **Send speech audio** whenever it becomes available (e.g., TTS output from your language model) — also buffer it locally for playback
6. **Receive speech frames** — the server transitions to lip-synced animation and returns to silence frames when audio runs out
7. **Render video frames** at 25fps, dropping excess silence frames to manage buffer size
8. **Start playing your buffered TTS audio** when the first speech frame (`index == 1`) arrives from Ojin — stop when speech ends

### Frame Types

Every frame arrives as a binary `InteractionResponse` containing both a JPEG image and a PCM audio chunk. Frames are always delivered in order. The `index` field identifies the frame type:

| Index | Frame type  | Description                                                                                 |
| ----- | ----------- | ------------------------------------------------------------------------------------------- |
| `0`   | **Silence** | Persona at rest with idle animation. Generated automatically when no speech audio is queued |
| `1`   | **Speech**  | Lip-synced animation generated from your audio input                                        |

### Faster-than-Realtime Generation

The server generates frames slightly **faster than realtime** to build a client-side buffer that prevents stuttering during speech. During speech bursts, generation is even faster. This means your frame buffer will grow over time if you don't manage it.

**You must drop silence frames to prevent unbounded buffer growth.** When your buffer starts growing beyond what you need for smooth playback, skip 1 out of every 2 silence frames (`index == 0`) until the buffer shrinks back down. Never drop speech frames (`index == 1`).

The right buffer target depends on your network conditions and latency requirements — start by observing your buffer size during playback and tuning from there. Keep it as low as possible to minimize latency, but high enough to absorb network jitter without starving playback.

```python
# When consuming frames from the buffer:
frame = buffer.popleft()

# If buffer is growing and this is a silence frame, skip every other one
if len(buffer) > target_buffer_size and frame.index == 0:
    skip_counter += 1
    if skip_counter % 2 == 0 and buffer:
        buffer.popleft()  # drop one silence frame
```

***

## Connection Flow

{% @mermaid/diagram content="sequenceDiagram
participant Client
participant Server

```
Note over Client,Server: Connection
Client->>Server: WebSocket Connect
Server->>Client: SessionReady (JSON)
Client->>Server: InteractionInput (silence audio for one frame)

Note over Client,Server: Server Streams Immediately
Server->>Client: Frame (silence, index=0)
Server->>Client: Frame (silence, index=0)
Server->>Client: Frame (silence, index=0)

Note over Client,Server: Client Sends Speech Audio
Client->>Server: InteractionInput (TTS audio chunk 1)
Client->>Server: InteractionInput (TTS audio chunk 2)

Note over Client,Server: Server Transitions to Speech
Server->>Client: Frame (speech, index=1)
Server->>Client: Frame (speech, index=1)
Server->>Client: Frame (speech, index=1)
Note right of Server: Burst: faster than realtime

Note over Client,Server: Audio Runs Out → Back to Silence
Server->>Client: Frame (silence, index=0)
Server->>Client: Frame (silence, index=0)
Note right of Client: Client drops excess silence frames" %}
```

***

## WebSocket Handshake

## Open WebSocket connection

> Connect to the WebSocket endpoint providing an API key in the \`Authorization\` header and a \`config\_id\` query parameter. The server upgrades the connection to WebSocket and immediately begins streaming frames after sending \`SessionReady\`.\
> \
> \*\*Recommended WebSocket settings:\*\*\
> \- \`ping\_interval\`: 30 seconds\
> \- \`ping\_timeout\`: 10 seconds

```json
{"openapi":"3.0.3","info":{"title":"Ojin Oris 1.0 Realtime API","version":"1.0.0"},"servers":[{"url":"wss://models.ojin.ai/realtime","description":"Production WebSocket endpoint"}],"security":[{"ApiKeyAuth":[]}],"components":{"securitySchemes":{"ApiKeyAuth":{"type":"apiKey","in":"header","name":"Authorization","description":"Raw API key (no `Bearer` prefix)."}},"schemas":{"SessionReadyMessage":{"type":"object","description":"Sent once by the server after the WebSocket connection is established and inference resources are allocated. **The server begins streaming frames immediately after this message.**\n\n**Format:** JSON text frame.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["sessionReady"]},"payload":{"type":"object","required":["trace_id","status","load"],"properties":{"trace_id":{"type":"string","format":"uuid","description":"Unique session identifier assigned by the server."},"status":{"type":"string","enum":["success"],"description":"Always `success`."},"load":{"type":"number","format":"float","minimum":0,"maximum":1,"description":"Current load of the inference server (0.0–1.0)."},"timestamp":{"type":"integer","format":"int64","description":"Server timestamp in milliseconds since Unix epoch."},"parameters":{"type":"object","additionalProperties":true,"nullable":true,"description":"Optional model-specific session parameters returned by the server."}}}}},"ErrorResponseMessage":{"type":"object","description":"Sent by the server when an error occurs.\n\n**Format:** JSON text frame.\n\n> **Note:** In some error conditions (e.g., no backend servers available), the server may send a plain text message instead of a structured JSON `ErrorResponse`. Your client should handle non-JSON text messages gracefully.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["errorResponse"]},"payload":{"type":"object","required":["code","message","timestamp"],"properties":{"code":{"type":"string","description":"Machine-readable error code.","enum":["AUTH_FAILED","UNAUTHORIZED","MISSING_CONFIG_ID","INVALID_MESSAGE","INVALID_HEADERS","MODEL_NOT_FOUND","BACKEND_UNAVAILABLE","RATE_LIMITED","TIMEOUT","CANCELLED","INTERNAL_ERROR","FRAME_SIZE_EXCEEDED"]},"message":{"type":"string","description":"Human-readable description of the error."},"interaction_id":{"type":"string","nullable":true,"description":"The interaction ID related to the error, if applicable."},"details":{"type":"object","additionalProperties":true,"nullable":true,"description":"Optional additional structured details about the error."},"timestamp":{"type":"integer","format":"int64","description":"Milliseconds since Unix epoch when the error was sent."}}}}}}},"paths":{"/":{"get":{"summary":"Open WebSocket connection","description":"Connect to the WebSocket endpoint providing an API key in the `Authorization` header and a `config_id` query parameter. The server upgrades the connection to WebSocket and immediately begins streaming frames after sending `SessionReady`.\n\n**Recommended WebSocket settings:**\n- `ping_interval`: 30 seconds\n- `ping_timeout`: 10 seconds","operationId":"wsHandshake","parameters":[{"in":"query","name":"config_id","required":true,"schema":{"type":"string"},"description":"Configuration ID for the persona, created in the Oris 1.0 tab of the dashboard."},{"in":"header","name":"Authorization","required":true,"schema":{"type":"string"},"description":"Your raw API key. No `Bearer` prefix."}],"responses":{"101":{"description":"WebSocket upgrade successful. After the upgrade, the server sends a `SessionReady` JSON message and begins streaming binary `InteractionResponse` frames immediately.","content":{"application/json":{"schema":{"$ref":"#/components/schemas/SessionReadyMessage"}}}},"401":{"description":"Unauthorized — invalid or missing API key.","content":{"application/json":{"schema":{"$ref":"#/components/schemas/ErrorResponseMessage"}}}}}}}}}
```

***

## Message Format

{% hint style="info" %}
**Mixed message types:** The server sends both JSON (text) and binary messages on the same WebSocket connection. Your client must check the WebSocket frame type to distinguish them:

* **Text frames (JSON):** `SessionReady`, `EndInteraction`, `CancelInteraction`, `ErrorResponse`
* **Binary frames:** `InteractionInput`, `InteractionResponse`
  {% endhint %}

{% hint style="info" %}
**Byte order:** All multi-byte integer fields in binary messages use **network byte order (big-endian)**.
{% endhint %}

***

## Messages Reference

### Server → Client Messages

## The SessionReadyMessage object

```json
{"openapi":"3.0.3","info":{"title":"Ojin Oris 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"SessionReadyMessage":{"type":"object","description":"Sent once by the server after the WebSocket connection is established and inference resources are allocated. **The server begins streaming frames immediately after this message.**\n\n**Format:** JSON text frame.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["sessionReady"]},"payload":{"type":"object","required":["trace_id","status","load"],"properties":{"trace_id":{"type":"string","format":"uuid","description":"Unique session identifier assigned by the server."},"status":{"type":"string","enum":["success"],"description":"Always `success`."},"load":{"type":"number","format":"float","minimum":0,"maximum":1,"description":"Current load of the inference server (0.0–1.0)."},"timestamp":{"type":"integer","format":"int64","description":"Server timestamp in milliseconds since Unix epoch."},"parameters":{"type":"object","additionalProperties":true,"nullable":true,"description":"Optional model-specific session parameters returned by the server."}}}}}}}}
```

## The InteractionResponseMessage object

````json
{"openapi":"3.0.3","info":{"title":"Ojin Oris 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"InteractionResponseMessage":{"type":"object","description":"Binary message containing a video frame and synchronized audio chunk. The server streams these continuously after `SessionReady` — silence frames when idle, speech frames when processing your audio.\n\n**Format:** Binary frame.\n\n**Binary structure (big-endian):**\n```\n[1 byte  ]  Is final flag   — uint8, 1 = last frame, 0 = more coming\n[16 bytes]  Interaction ID  — UUID bytes\n[8 bytes ]  Timestamp       — uint64, milliseconds since Unix epoch\n[4 bytes ]  Usage           — uint32, usage metric\n[4 bytes ]  Frame index     — uint32, 0 = silence, 1 = speech\n[4 bytes ]  Num payloads    — uint32, number of payload entries\n\nFor each payload entry:\n  [4 bytes]  Data size       — uint32, byte length of payload data\n  [1 byte ]  Payload type    — uint8, 1 = audio, 2 = image\n  [N bytes]  Payload data    — raw payload bytes\n```\n\nPython unpack: `struct.unpack('!B16sQIII', header)` for the main header, `struct.unpack('!IB', entry)` for each payload entry.","required":["is_final","interaction_id","timestamp","usage","index","payloads"],"properties":{"is_final":{"type":"boolean","description":"`true` if this is the last frame for the current interaction. `false` if more frames are coming."},"interaction_id":{"type":"string","format":"uuid","description":"UUID identifying this response. Use to correlate frames across a single interaction."},"timestamp":{"type":"integer","format":"int64","description":"Milliseconds since Unix epoch when the frame was sent."},"usage":{"type":"integer","format":"int32","description":"Usage metric for this response."},"index":{"type":"integer","format":"int32","enum":[0,1],"description":"Frame type. `0` = silence frame (idle animation, no speech input). `1` = speech frame (lip-synced to your audio). **Drop silence frames (`0`) to manage buffer size. Never drop speech frames (`1`).**"},"payloads":{"type":"array","description":"List of payload entries in this frame. Each frame typically contains one audio entry and one image entry.","items":{"type":"object","required":["payload_type","data"],"properties":{"payload_type":{"type":"integer","enum":[1,2],"description":"`1` = audio (PCM int16, 16kHz mono, 1,280 bytes = 640 samples = 40ms). `2` = image (JPEG-encoded, resolution depends on config e.g. 1280×720)."},"data_size":{"type":"integer","format":"int32","description":"Byte length of the payload data."},"data":{"type":"string","format":"binary","description":"Raw payload bytes. For audio: PCM int16 bytes. For image: JPEG bytes."}}}}}}}}}
````

## The ErrorResponseMessage object

```json
{"openapi":"3.0.3","info":{"title":"Ojin Oris 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"ErrorResponseMessage":{"type":"object","description":"Sent by the server when an error occurs.\n\n**Format:** JSON text frame.\n\n> **Note:** In some error conditions (e.g., no backend servers available), the server may send a plain text message instead of a structured JSON `ErrorResponse`. Your client should handle non-JSON text messages gracefully.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["errorResponse"]},"payload":{"type":"object","required":["code","message","timestamp"],"properties":{"code":{"type":"string","description":"Machine-readable error code.","enum":["AUTH_FAILED","UNAUTHORIZED","MISSING_CONFIG_ID","INVALID_MESSAGE","INVALID_HEADERS","MODEL_NOT_FOUND","BACKEND_UNAVAILABLE","RATE_LIMITED","TIMEOUT","CANCELLED","INTERNAL_ERROR","FRAME_SIZE_EXCEEDED"]},"message":{"type":"string","description":"Human-readable description of the error."},"interaction_id":{"type":"string","nullable":true,"description":"The interaction ID related to the error, if applicable."},"details":{"type":"object","additionalProperties":true,"nullable":true,"description":"Optional additional structured details about the error."},"timestamp":{"type":"integer","format":"int64","description":"Milliseconds since Unix epoch when the error was sent."}}}}}}}}
```

### Client → Server Messages

## The InteractionInputMessage object

````json
{"openapi":"3.0.3","info":{"title":"Ojin Oris 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"InteractionInputMessage":{"type":"object","description":"Binary message for sending speech audio to the server. **Only send speech audio** — do not send silence or padding. The server generates silence frames automatically.\n\n**Format:** Binary frame.\n\n**Binary structure (big-endian):**\n```\n[1 byte ]  Payload type   — uint8, always 1 for audio\n[8 bytes]  Timestamp      — uint64, milliseconds since Unix epoch\n[4 bytes]  Params size    — uint32, byte length of JSON params (0 if none)\n[N bytes]  Params JSON    — UTF-8 JSON (only present if params size > 0)\n[M bytes]  Audio payload  — raw PCM int16 speech audio\n```\n\nPython pack: `struct.pack('!BQI', payload_type, timestamp, params_size)`","required":["payload_type","timestamp","params_size","audio_payload"],"properties":{"payload_type":{"type":"integer","enum":[1],"description":"Always `1` for audio."},"timestamp":{"type":"integer","format":"int64","description":"Milliseconds since Unix epoch when the message was sent."},"params_size":{"type":"integer","format":"int32","minimum":0,"description":"Byte length of the JSON params block. `0` if no params."},"params":{"type":"object","nullable":true,"description":"Optional per-chunk parameters. Overrides session defaults for this audio chunk.","properties":{"speech_filter_amount":{"type":"number","format":"float","default":5,"description":"Smoothing for speech animation. Higher = smoother, less responsive."},"idle_filter_amount":{"type":"number","format":"float","default":1000,"description":"Smoothing for idle animation."},"idle_mouth_opening_scale":{"type":"number","format":"float","default":0,"description":"Mouth movement scale during idle. `0.0` = closed."},"speech_mouth_opening_scale":{"type":"number","format":"float","default":1,"description":"Mouth movement scale during speech. `1.0` = full movement."},"client_frame_index":{"type":"integer","format":"int32","default":0,"description":"Frame index the client is currently displaying. Helps the server manage silence-to-speech transitions smoothly."}}},"audio_payload":{"type":"string","format":"binary","description":"Raw PCM int16 speech audio. Requirements: 16,000 Hz sample rate, mono (1 channel), little-endian int16 samples. Entire message must be under 512 KB. Recommended chunk size: 400ms = 6,400 samples = 12,800 bytes."}}}}}}
````

## The EndInteractionMessage object

```json
{"openapi":"3.0.3","info":{"title":"Ojin Oris 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"EndInteractionMessage":{"type":"object","description":"Signal graceful end of the session. The server finishes processing all queued audio and sends remaining frames, with the last frame marked `is_final: true`.\n\n**Format:** JSON text frame.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["endInteraction"]},"payload":{"type":"object","required":["timestamp"],"properties":{"timestamp":{"type":"integer","format":"int64","description":"Milliseconds since Unix epoch when the message was sent."}}}}}}}}
```

## The CancelInteractionMessage object

```json
{"openapi":"3.0.3","info":{"title":"Ojin Oris 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"CancelInteractionMessage":{"type":"object","description":"Immediately stop processing and discard all remaining frames. No final frame is sent. Use for interruptions (e.g., user starts speaking while the persona is talking).\n\n**Format:** JSON text frame.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["cancelInteraction"]},"payload":{"type":"object","properties":{"timestamp":{"type":"integer","format":"int64","nullable":true,"description":"Optional. Milliseconds since Unix epoch when the message was sent."}}}}}}}}
```

***

## Message Details

### InteractionInput (Client → Server, Binary)

Binary message for sending speech audio to the server. **Only send speech audio** — do not send silence or padding.

**Binary structure:**

```
[1 byte ]  Payload type      — uint8, always 1 for audio
[8 bytes]  Timestamp          — uint64, milliseconds since Unix epoch
[4 bytes ]  Params size        — uint32, byte length of the JSON params block (0 if no params)
[N bytes]  Params JSON        — UTF-8 encoded JSON (only present if params size > 0)
[M bytes]  Audio payload      — raw PCM int16 speech audio data
```

**Header fields** use **big-endian** byte order. The PCM audio samples in the payload use **little-endian** (standard for PCM int16). In Python: `struct.pack('!BQI', payload_type, timestamp, params_size)`.

**Audio requirements:**

| Property         | Value                                              |
| ---------------- | -------------------------------------------------- |
| Format           | PCM signed 16-bit integers (little-endian samples) |
| Sample rate      | 16,000 Hz                                          |
| Channels         | 1 (mono)                                           |
| Max message size | 512 KB (entire binary message including header)    |

**Recommended streaming pattern:**

Forward speech audio to the server as it arrives from your TTS service — no need to buffer or accumulate before sending (buffering only needed for playback). Send the audio in realtime to achieve a realtime streaming and smooth playback experience.

```python
import struct, json, time

def build_audio_message(audio_bytes):
    """Build a binary InteractionInput message."""
    header = struct.pack('!BQI', 1, int(time.time() * 1000), 0)
    return header + audio_bytes
```

***

### InteractionResponse (Server → Client, Binary)

Binary message containing a video frame and synchronized audio. The server streams these continuously after `SessionReady`. **Frames always arrive in order.**

**Binary structure:**

```
[1 byte  ]  Is final flag     — uint8, 1 = last frame for this interaction, 0 = more coming
[16 bytes]  Interaction ID     — UUID bytes (big-endian)
[8 bytes ]  Timestamp          — uint64, milliseconds since Unix epoch
[4 bytes ]  Usage              — uint32, usage metric for this response
[4 bytes ]  Frame index        — uint32, 0 = silence, 1 = speech
[4 bytes ]  Num payloads       — uint32, number of payload entries that follow

For each payload entry:
  [4 bytes]  Data size          — uint32, byte length of the payload data only
  [1 byte ]  Payload type       — uint8, 1 = audio, 2 = image
  [N bytes]  Payload data       — raw payload bytes
```

All multi-byte integers are **big-endian**. In Python: `struct.unpack('!B16sQIII', header_bytes)` for the main header, `struct.unpack('!IB', entry_bytes)` for each payload entry.

**Frame index:**

| Index | Meaning                                                 |
| ----- | ------------------------------------------------------- |
| `0`   | **Silence frame** — persona at rest with idle animation |
| `1`   | **Speech frame** — lip-synced animation from your audio |

**Payload types:**

| Type      | Format                | Typical size per frame                                 |
| --------- | --------------------- | ------------------------------------------------------ |
| 1 (audio) | PCM int16, 16kHz mono | **1,280 bytes** (640 samples = 40ms at 25fps)          |
| 2 (image) | JPEG-encoded image    | Variable (resolution depends on config, e.g. 1280×720) |

**Parsing example:**

```python
import struct, uuid

HEADER_FMT = '!B16sQIII'
HEADER_SIZE = struct.calcsize(HEADER_FMT)   # 37 bytes
ENTRY_FMT = '!IB'
ENTRY_SIZE = struct.calcsize(ENTRY_FMT)     # 5 bytes

def parse_response(data):
    is_final, uuid_bytes, timestamp, usage, index, num_payloads = \
        struct.unpack(HEADER_FMT, data[:HEADER_SIZE])

    offset = HEADER_SIZE
    image = audio = None

    for _ in range(num_payloads):
        size, ptype = struct.unpack(ENTRY_FMT, data[offset:offset + ENTRY_SIZE])
        offset += ENTRY_SIZE
        payload = data[offset:offset + size]
        offset += size

        if ptype == 2:
            image = payload   # JPEG bytes
        elif ptype == 1:
            audio = payload   # PCM int16 bytes

    return {
        'is_final': bool(is_final),
        'index': index,              # 0 = silence, 1 = speech
        'image': image,
        'audio': audio,
    }
```

***

### EndInteraction vs CancelInteraction

| Message             | Purpose         | Server behavior                                                                | Use case          |
| ------------------- | --------------- | ------------------------------------------------------------------------------ | ----------------- |
| `EndInteraction`    | Graceful finish | Completes processing, sends remaining frames with last marked `is_final: true` | Session end       |
| `CancelInteraction` | Immediate stop  | Stops processing, discards remaining frames                                    | User interruption |

***

### ErrorResponse (Server → Client, JSON)

{% hint style="warning" %}
**Plain text errors:** In some error conditions (e.g., no backend servers available), the server may send a plain text message instead of a structured JSON `ErrorResponse`. Your client should handle non-JSON text messages gracefully.
{% endhint %}

**Error codes:**

| Code                  | Description                              |
| --------------------- | ---------------------------------------- |
| `AUTH_FAILED`         | Invalid API key                          |
| `UNAUTHORIZED`        | Caller lacks permission                  |
| `MISSING_CONFIG_ID`   | `config_id` query parameter not provided |
| `INVALID_MESSAGE`     | Malformed or unsupported message payload |
| `INVALID_HEADERS`     | Missing or invalid headers               |
| `MODEL_NOT_FOUND`     | Config ID not found or invalid           |
| `BACKEND_UNAVAILABLE` | No healthy inference backend available   |
| `RATE_LIMITED`        | Too many requests                        |
| `TIMEOUT`             | Operation exceeded processing time       |
| `CANCELLED`           | Interaction cancelled by client          |
| `INTERNAL_ERROR`      | Unexpected server error                  |
| `FRAME_SIZE_EXCEEDED` | Message exceeded 512KB limit             |

***

## Rate Limits & Constraints

| Constraint       | Value                                                   |
| ---------------- | ------------------------------------------------------- |
| Rate limit       | 6 requests per second                                   |
| Max message size | 512 KB per message                                      |
| Video output     | 25 fps target (generated slightly faster than realtime) |

Exceeding limits results in an `ErrorResponse` with code `RATE_LIMITED`.

***

## Best Practices

### Audio Input

* **Send one silence frame first** to start the conversation, then send speech audio when available
* Forward speech audio to the server as it arrives from your TTS service — no need to buffer or accumulate before sending (buffering only needed for playback). Send the audio in realtime to achieve a realtime streaming and smooth playback experience.

### Buffer Management

* Play frames at **25 fps** (40ms per frame)
* The server generates slightly faster than realtime — **you must drop silence frames** to prevent the buffer from growing
* **Drop strategy:** when the buffer grows beyond your target size, skip 1 out of every 2 silence frames (`index == 0`) until the buffer shrinks back
* **Never drop speech frames** (`index == 1`)
* Tune your target buffer size based on your network conditions — keep it as low as possible for minimal latency

### Audio and Video Synchronization

Each `InteractionResponse` contains both a JPEG image and a PCM audio chunk. However, the **recommended approach for audio playback** is:

1. **Buffer your TTS/source audio locally** as it arrives from your speech service (for later playback only. You do not need to buffer it to send it to Ojin)
2. **Forward TTS audio to Ojin immediately** as it arrives from your speech service
3. **Wait for the first speech frame** (`index == 1`) to arrive from Ojin
4. **Start playing your buffered TTS audio** at that moment
5. **Stop audio playback** when you receive the first silence frame (`index == 0`) after speech
6. **Render video** from every frame regardless of type

Sending is immediate, but playback is gated on the first speech frame. This ensures audio and video stay in sync.

```python
# When TTS audio arrives from your speech service:
speech_audio_buffer.extend(tts_audio_chunk)      # buffer locally for playback
await ojin.send_audio(tts_audio_chunk)            # send to Ojin for lip-sync

# In your video playback loop:
frame = buffer.popleft()

if frame.index == 1 and not audio_playing:
    start_audio_playback(speech_audio_buffer)     # begin draining the buffer
    audio_playing = True

if frame.index == 0 and audio_playing:
    stop_audio_playback()
    audio_playing = False

render_video(frame.image)                         # always render the video
```

### Error Handling

* Handle both JSON `ErrorResponse` messages and plain text error strings
* Implement exponential backoff for reconnection
* Monitor server `load` in the `SessionReady` message

### Interruption Handling

* Use `CancelInteraction` for immediate stops (e.g., user interrupts the bot)
* Use `EndInteraction` for graceful session endings
* Clear your frame buffer on interruption

***

## Complete Example

```python
import asyncio
import json
import struct
import time
from collections import deque
import numpy as np
import websockets
from dotenv import load_dotenv
import os

load_dotenv()

API_KEY = os.getenv("OJIN_API_KEY", "")
CONFIG_ID = os.getenv("OJIN_CONFIG_ID", "")
URL = f"wss://models.ojin.ai/realtime?config_id={CONFIG_ID}"

SAMPLE_RATE = 16000
FPS = 25
TARGET_BUFFER = 10  # Tune based on your network conditions

def build_audio_message(audio_bytes):
    """Build a binary InteractionInput message."""
    header = struct.pack('!BQI', 1, int(time.time() * 1000), 0)
    return header + audio_bytes

def parse_response(data):
    """Parse a binary InteractionResponse message."""
    fmt = '!B16sQIII'
    hdr_size = struct.calcsize(fmt)
    is_final, uuid_bytes, timestamp, usage, index, num_payloads = \
        struct.unpack(fmt, data[:hdr_size])

    offset = hdr_size
    image = audio = None

    for _ in range(num_payloads):
        size, ptype = struct.unpack('!IB', data[offset:offset + 5])
        offset += 5
        if ptype == 2:
            image = data[offset:offset+size]
        elif ptype == 1:
            audio = data[offset:offset+size]
        offset += size

    return {
        'is_final': bool(is_final),
        'index': index,              # 0 = silence, 1 = speech
        'image': image,
        'audio': audio,
    }

async def main():
    headers = {"Authorization": API_KEY}
    # For older websockets versions, use extra_headers instead.
    async with websockets.connect(URL, additional_headers=headers, ping_interval=30) as ws:
        # 1. Wait for SessionReady — server starts streaming frames immediately after
        msg = json.loads(await ws.recv())
        assert msg["type"] == "sessionReady"
        print(f"Session ready: {msg['payload']}")

        buffer = deque()
        skip_counter = 0
        playback_started = False
        frame_count = 0
        audio_sent = False

        # 2. Send one silence frame to start processing
        audio_data = b"\00\00" * (SAMPLE_RATE // 25)
        await ws.send(build_audio_message(audio_data))

        # 3. Receive and process frames
        async for data in ws:
            if isinstance(data, str):
                msg = json.loads(data)
                if msg.get("type") == "errorResponse":
                    print(f"Error: {msg['payload']}")
                    break
                continue

            frame = parse_response(data)
            buffer.append(frame)
            frame_count += 1

            # Wait for initial buffer before playback
            if not playback_started:
                if len(buffer) >= TARGET_BUFFER:
                    playback_started = True
                    print(f"Buffer filled ({TARGET_BUFFER} frames), starting playback")
                continue

            # Consume one frame
            if buffer:
                play_frame = buffer.popleft()

                # Drop excess silence frames: skip 1 out of 2 when buffer is too large
                if len(buffer) > TARGET_BUFFER and play_frame['index'] == 0:
                    skip_counter += 1
                    if skip_counter % 2 == 0 and len(buffer) > 0:
                        buffer.popleft()  # drop one silence frame

                kind = "silence" if play_frame['index'] == 0 else "speech"
                print(f"[{kind}] frame #{frame_count}, buffer={len(buffer)}")

                # In a real app: render play_frame['image'] and play play_frame['audio']

            # Demo: send speech audio after receiving some silence frames
            if frame_count == 50 and not audio_sent:
                t = np.linspace(0, 1.0, SAMPLE_RATE, endpoint=False)
                audio_data = (32767 * 0.5 * np.sin(2 * np.pi * 440 * t)).astype(np.int16)
                chunk_size = SAMPLE_RATE # 1s chunks
                for i in range(0, len(audio_data), chunk_size):
                    chunk = audio_data[i:i + chunk_size]
                    await ws.send(build_audio_message(chunk.tobytes()))
                audio_sent = True
                print("Sent 1 second of speech audio")

            if frame_count > 200:
                break

asyncio.run(main())
```

***

## Troubleshooting

### Connection Issues

* ✓ Verify API key and config ID
* ✓ Check that config exists in dashboard
* ✓ Ensure network allows WebSocket connections (port 443)
* ✓ Check the `Authorization` header uses the raw API key (no `Bearer` prefix)

### No Frames Received

* ✓ Confirm you received `SessionReady` — frames start streaming immediately after
* ✓ If sending speech audio: verify format is 16kHz, int16, mono with big-endian message header
* ✓ Check message size < 512KB

### Choppy Playback

* ✓ Play at 25fps (40ms per frame)
* ✓ Buffer some frames before starting playback
* ✓ Check network latency and jitter

### Growing Latency

* ✓ You **must** drop silence frames — the server generates faster than realtime
* ✓ Skip 1 out of 2 silence frames (`index == 0`) when buffer grows beyond your target
* ✓ Never drop speech frames (`index == 1`)
* ✓ During speech bursts the buffer will grow temporarily — this is expected, trim silence frames afterward

### Frame Lag During Speech

* ✓ Reduce `speech_filter_amount` parameter (lower = more responsive, less smooth)

***
