# Welcome

## What is Ojin?

Ojin is a real-time AI inference platform for developers. We provide state-of-the-art real-time models via ultra-low-latency APIs. Our flagship model `ojin/oris-1.0` provides a way to create AI personas from a single reference video and animate them in real-time by sending audio inputs.

## LLM-Ready Docs

{% hint style="info" %}
This documentation is optimized for Large Language Models access. You can access it via MCP ([MCP Server URL](https://docs.ojin.ai/~gitbook/mcp)), [llms.txt](https://docs.ojin.ai/llms.txt), [llms-full.txt](https://docs.ojin.ai/llms-full.txt), and by appending `.md` to any page URL for raw markdown content.
{% endhint %}

## Core Features

* **Realistic Persona Generation**: Create lifelike personas from a single reference video
* **Audio-to-Persona Synthesis**: Convert audio inputs into synchronized lip movements and natural expressions
* **Real-time Streaming**: Deliver persona animations with industry-leading low latency through websockets
* **API-first Design**: Integrate personas into any application with our comprehensive API
* **Cost-effective**: Get the highest quality personas at the most competitive pricing
* **Auto-scale**: Scale your applications automatically based on your needs

## Available Models

Ojin currently offers real-time WebSocket APIs for the following generative AI models:

#### ojin/oris-1.0

A lifelike persona model that transforms reference videos into natural video animations with audio-synchronized lip movements and expressions.

[Learn More →](https://docs.ojin.ai/models/oris)

## Use Cases

* **Customer Support**: Enhance support experiences with personalized persona interactions.
* **Education**: Develop interactive tutors and educational content.
* **Entertainment**: Build interactive characters for games and entertainment apps.
* **Accessibility**: Make digital content more accessible with visual communication.

## Getting Started

#### Quick Start

Create your first AI persona in minutes

[Get Started →](https://docs.ojin.ai/getting-started/quickstart)

#### Authentication

Set up API access for your application

[Learn more →](https://docs.ojin.ai/getting-started/authentication)

#### Available Models

Explore our persona models and capabilities

[View Models →](https://docs.ojin.ai/models/oris)


# Quickstart

> Get started with ojin/oris persona model in minutes.

This guide will walk you through creating your first AI persona and integrating it with Pipecat for real-time conversational experiences.

## Step 1: Create an API Key

[Get your API key from the Ojin dashboard](https://docs.ojin.ai/getting-started/authentication). This will allow you to use our models in your applications through a secure environment.

{% hint style="warning" %}
Never hardcode your API key directly in your application code or commit it to version control.
{% endhint %}

## Step 2: Create Your Model Configuration

[Create a model config through Ojin dashboard](https://docs.ojin.ai/models/oris/creating-persona).

## Step 3: Integrate with ojin/oris-1.0 model

[Integrate with your application of choice](https://docs.ojin.ai/models/oris/integrations).

## Troubleshooting

[Check for common troubleshooting questions](https://github.com/journee-live/ojin/blob/main/docs/public/troubleshooting.md).

## Next Steps

Now that you have a working persona chatbot, you can explore:

#### API Reference

Dive deeper into the complete API documentation

[View API Reference →](https://docs.ojin.ai/models/oris/api)


# Get your API key

All requests to the Ojin API require authentication using API keys. This guide explains how to create, manage, and securely use API keys in your applications.

## Creating an API Key

1. **Sign in** to your [Ojin Dashboard](https://ojin.ai)
2. Navigate to the **API Keys** section
3. Click **Create API Key**
4. Enter a descriptive name for your key (e.g., "Development", "Production")
5. Click **Create**
6. **Important**: Copy and store your API key securely. It will only be shown once.

{% hint style="warning" %}
API keys provide full access to your Ojin resources. Never expose them in client-side code, public repositories, or share them with unauthorized individuals.
{% endhint %}

## API Key Best Practices

* **Separate keys** for development and production environments
* **Use environment variables** to store API keys without exposing them publicly
* **Restrict permissions** to only what's needed for each key
* **Rotate keys** periodically for enhanced security
* **Revoke compromised keys** immediately in your dashboard
* **Use secret management services** in production environments
* **Monitor usage** to detect unusual patterns that might indicate a leak


# Support

Need help with the Ojin platform? We're here to assist you with any questions, issues, or feedback you may have.

## Contact Us

For support inquiries, please reach out to our team:

**Email**: <hello@ojin.ai>

Our support team will respond to your inquiry as soon as possible.

## What to Include in Your Support Request

To help us assist you more efficiently, please include the following information when contacting support:

* **Description**: A clear description of your issue or question
* **Model**: Which model you're working with (e.g., ojin/oris-1.0)
* **Error Messages**: Any error messages or codes you're encountering
* **Steps to Reproduce**: If applicable, steps to reproduce the issue
* **Expected vs Actual Behavior**: What you expected to happen vs what actually happened
* **Environment**: Your development environment details (language, framework, etc.)

{% hint style="info" %}
Before reaching out, check our [Troubleshooting](https://docs.ojin.ai/best-practices/troubleshooting) guide for common issues and solutions.
{% endhint %}

## Additional Resources

* [Documentation](https://docs.ojin.ai/getting-started/readme)
* [Quickstart Guide](https://docs.ojin.ai/getting-started/quickstart)
* [API Reference](https://docs.ojin.ai/models/oris/api)
* [Troubleshooting](https://docs.ojin.ai/best-practices/troubleshooting)


# ojin/oris-1.0

> A lifelike persona model that transforms reference videos into natural animated personas

## Overview

The ojin/oris-1.0 model is our flagship persona generation technology that creates realistic, expressive digital humans from a reference video. It excels at producing natural facial animations, lip-syncing, and emotional expressions that bring your persona to life with synchronized speech.

## Key Features

* **Full persona look control** - Generate a persona based on any video reference, the persona will behave exactly the same
* **No training required** - You don't need to wait for your persona to be ready, as soon as reference video is uploaded, you can start using it
* **Natural Lip-Syncing** - Precise lip movements synchronized with speech audio
* **Emotional Expressions** - Support for multiple emotional states and expressions
* **Low Latency** - Fast processing for real-time applications
* **High Resolution** - Support for up to 720p output resolution

## Quick Start

Getting started with ojin/oris-1.0 is simple:

1. [**Create an API key**](https://docs.ojin.ai/getting-started/authentication) - Set up authentication for the Ojin platform
2. [**Use a persona template**](https://docs.ojin.ai/models/oris/using-persona-template) - Use a persona template to generate your persona in seconds
3. [**Integrate with your application**](https://docs.ojin.ai/models/oris/integrations) - Use either Pipecat or WebSocket API

## Use Cases

* **Virtual Assistants** - Create responsive customer service personas
* **Educational Content** - Develop engaging tutors and instructors
* **Entertainment** - Produce animated characters for games and media
* **Presentations** - Transform static slides into dynamic video presentations
* **Healthcare** - Build empathetic virtual health assistants


# Get started

This guide explains how to integrate the `ojin/oris-1.0` persona model into your applications using either Pipecat or Websockets

## Prerequisites

1. An Ojin account with an active API key, if you don't have one [get your API key](https://github.com/journee-live/ojin/blob/main/docs/public/models/ojin/oris/authentication.md)
2. [Create a Persona](https://docs.ojin.ai/models/oris/creating-persona) or use a [Persona Template](https://docs.ojin.ai/models/oris/using-persona-template)
3. Save the Persona Configuration ID from the dashboard
4. Integrate with your application using either [Pipecat](#pipecat-integration) or [Websockets](#websocket-integration)

{% hint style="info" %}
**Production deployments:** For secure, low-latency video applications, connect to the real-time WebSocket API from a backend server rather than a front-end client (to keep your API key secure and leverage a network transport appropriate for real-time video media delivery under varying network conditions). Typically, WebRTC is used to deliver the final media stream to end users for smooth, reliable, low-latency playback.
{% endhint %}

{% tabs %}
{% tab title="Pipecat" %}

### Pipecat Integration

[Pipecat](https://github.com/pipecat-ai/pipecat) is a powerful open source framework for building conversational AI pipelines. The `ojin/oris-1.0` model integrates seamlessly with Pipecat through our dedicated `OjinVideoService`.

#### Option 1: Clone pipecat repository and checkout the ready to use [ojin-chatbot example](https://github.com/journee-live/pipecat-ojin/tree/main/examples/ojin-chatbot)

To start using it, create a python virtual environment on it and install requirements

```bash
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

Create a `.env` file and add your Ojin API key and persona ID

```bash
OJIN_API_KEY="your_api_key_here"
OJIN_CONFIG_ID="your_persona_id_here"
```

Then just run the [mock\_bot.py](https://github.com/journee-live/pipecat-ojin/blob/main/examples/ojin-chatbot/mock/mock_bot.py) to check that your ojin setup is correct and see a generation out of a wav file:

```bash
python mock/mock_bot.py
```

Alternatively, you can configure all required environment variables for the services used in this example (such as Hume) by referring to env.example. Once configured, you can interact with a conversational, human-like bot using your local audio input/output

```bash
python bot.py
```

#### How It Works

1. The microphone listens for speech input
2. Voice Activity Detection identifies speech segments
3. User audio is sent to Hume to get an LLM response using their Speech-To-Speech service.
4. The OjinVideoService animates your persona based on the STS audio.
5. Video frames are received and displayed in real-time together with the audio.

{% hint style="info" %}
You can customize the pipeline by adding or removing components, or by adjusting their parameters to suit your needs.
{% endhint %}
{% endtab %}

{% tab title="WebSocket" %}

### WebSocket Integration

For Websocket integration check our [API Reference →](https://docs.ojin.ai/models/oris/api)
{% endtab %}
{% endtabs %}

## Next Steps

#### API Reference

Dive deeper into the model API for custom integrations

[API Reference →](https://docs.ojin.ai/models/oris/api)


# Using a Persona Template

Learn how to create a persona out of a template to get started ASAP with your application.

## Prerequisites

* An Ojin account with an active API key

## Creating a Persona through the Dashboard

The simplest way to create a persona is through the Ojin Dashboard:

1. Log in to the [Ojin Dashboard](https://ojin.ai)
2. Navigate to the [**Oris 1.0**](https://ojin.ai/models/ojin/oris-1.0) section
3. Navigate to [**Configs**](https://ojin.ai/models/ojin/oris-1.0/configs) sub-section
4. Select a persona template and press **Copy Template**
5. Open the newly created model configuration and save **Model Config ID** parameter which will be used by your application
6. You can now integrate it through the [**model API endpoints**](https://ojin.ai/models/ojin/oris-1.0/docs)

## Next Steps

Once your persona is ready, you can:

#### Integration Guide

Learn how to integrate your persona using Pipecat or WebSocket

[View Integration Guide →](https://docs.ojin.ai/models/oris/integrations)

#### API Reference

Explore the complete API documentation

[View API Reference →](https://docs.ojin.ai/models/oris/api)


# Creating a custom Persona

Before you can start integrating with the ojin/oris model, you'll need to create a persona configuration. This guide walks you through the process of creating a persona that looks exactly how you want.

## Prerequisites

* An Ojin account with an active API key
* A high-quality reference video of the persona you want to animate (check [Reference Video best practices](#reference-video-best-practices) for more details)

{% hint style="info" %}
For best results, follow reference video best practices next
{% endhint %}

## Creating a Persona configuration through the Dashboard

The simplest way to create a persona is through the Ojin Dashboard:

1. Log in to the [Ojin Dashboard](https://ojin.ai)
2. Select the [**ojin/oris-1.0**](https://ojin.ai/models/ojin/oris-1.0) model
3. Navigate to [**Configs**](https://ojin.ai/models/ojin/oris-1.0/configs) sub-section
4. Press **New Configuration** button to create a new configuration
5. Fill in required fields and upload a reference video. Make sure to follow the instructions bellow on how the video should look.
6. Click **Create Configuration**
7. Open your newly created configuration and copy **Model Config ID** parameter which will be used by your application

### Reference Video best practices

* **Video content**:
  * The video will be used as base for your persona idle loop and speech. Make sure to make the movements of the person balanced, not too expressive, not too still. so that it fits both states
  * Mouth should be closed during the entire video, but smiles or gestures are not a problem
  * The eyes should be looking directly into the camera.
  * Minimum length of video for better results recommended is 15 seconds
  * The length of the video should not be more than 30 seconds.
* **Resolution**: Use 1080p resolution for the reference video and at least 25 fps
* **Lighting**: Ensure even lighting with no harsh shadows
* **Face Position**: The face should be clearly visible and centered
* **Background**: Simple backgrounds work best
* **Accessories**: Avoid sunglasses or items that obscure facial features

## Next Steps

Once your persona is ready, you can:

#### Integration Guide

Learn how to integrate your persona using Pipecat or WebSocket

[View Integration Guide →](https://docs.ojin.ai/models/oris/integrations)

#### API Reference

Explore the complete API documentation

[View API Reference →](https://docs.ojin.ai/models/oris/api)


# API Reference

## Overview

Real-time talking head synthesis API. Send speech audio, receive synchronized video and audio frames.

After connecting and receiving `SessionReady`, you should send one frame message with silence to start the interaction on the server, the server then immediately begins streaming video and audio frames at 25fps. When no speech audio has been sent, the server generates **silence frames** (persona at rest with idle animation). When you send speech audio, the server generates **speech frames** with lip-synced animation synchronized to your audio.

You only need to send speech audio — no silence, padding, or keep-alive messages are required.

{% hint style="info" %}
**Production deployments:** For secure, low-latency video applications, connect to the real-time WebSocket API from a backend server rather than a front-end client (to keep your API key secure and leverage a network transport appropriate for real-time video media delivery under varying network conditions). Typically, WebRTC is used to deliver the final media stream to end users for smooth, reliable, low-latency playback.
{% endhint %}

***

## How It Works

1. **Connect** to the WebSocket endpoint with your API key and config ID
2. **Receive `SessionReady`** — the server has allocated inference resources for your session
3. **Send** initial audio message with silence for one frame.
4. **The server starts streaming frames immediately** — silence frames with idle animation
5. **Send speech audio** whenever it becomes available (e.g., TTS output from your language model) — also buffer it locally for playback
6. **Receive speech frames** — the server transitions to lip-synced animation and returns to silence frames when audio runs out
7. **Render video frames** at 25fps, dropping excess silence frames to manage buffer size
8. **Start playing your buffered TTS audio** when the first speech frame (`index == 1`) arrives from Ojin — stop when speech ends

### Frame Types

Every frame arrives as a binary `InteractionResponse` containing both a JPEG image and a PCM audio chunk. Frames are always delivered in order. The `index` field identifies the frame type:

| Index | Frame type  | Description                                                                                 |
| ----- | ----------- | ------------------------------------------------------------------------------------------- |
| `0`   | **Silence** | Persona at rest with idle animation. Generated automatically when no speech audio is queued |
| `1`   | **Speech**  | Lip-synced animation generated from your audio input                                        |

### Faster-than-Realtime Generation

The server generates frames slightly **faster than realtime** to build a client-side buffer that prevents stuttering during speech. During speech bursts, generation is even faster. This means your frame buffer will grow over time if you don't manage it.

**You must drop silence frames to prevent unbounded buffer growth.** When your buffer starts growing beyond what you need for smooth playback, skip 1 out of every 2 silence frames (`index == 0`) until the buffer shrinks back down. Never drop speech frames (`index == 1`).

The right buffer target depends on your network conditions and latency requirements — start by observing your buffer size during playback and tuning from there. Keep it as low as possible to minimize latency, but high enough to absorb network jitter without starving playback.

```python
# When consuming frames from the buffer:
frame = buffer.popleft()

# If buffer is growing and this is a silence frame, skip every other one
if len(buffer) > target_buffer_size and frame.index == 0:
    skip_counter += 1
    if skip_counter % 2 == 0 and buffer:
        buffer.popleft()  # drop one silence frame
```

***

## Connection Flow

{% @mermaid/diagram content="sequenceDiagram
participant Client
participant Server

```
Note over Client,Server: Connection
Client->>Server: WebSocket Connect
Server->>Client: SessionReady (JSON)
Client->>Server: InteractionInput (silence audio for one frame)

Note over Client,Server: Server Streams Immediately
Server->>Client: Frame (silence, index=0)
Server->>Client: Frame (silence, index=0)
Server->>Client: Frame (silence, index=0)

Note over Client,Server: Client Sends Speech Audio
Client->>Server: InteractionInput (TTS audio chunk 1)
Client->>Server: InteractionInput (TTS audio chunk 2)

Note over Client,Server: Server Transitions to Speech
Server->>Client: Frame (speech, index=1)
Server->>Client: Frame (speech, index=1)
Server->>Client: Frame (speech, index=1)
Note right of Server: Burst: faster than realtime

Note over Client,Server: Audio Runs Out → Back to Silence
Server->>Client: Frame (silence, index=0)
Server->>Client: Frame (silence, index=0)
Note right of Client: Client drops excess silence frames" %}
```

***

## WebSocket Handshake

## Open WebSocket connection

> Connect to the WebSocket endpoint providing an API key in the \`Authorization\` header and a \`config\_id\` query parameter. The server upgrades the connection to WebSocket and immediately begins streaming frames after sending \`SessionReady\`.\
> \
> \*\*Recommended WebSocket settings:\*\*\
> \- \`ping\_interval\`: 30 seconds\
> \- \`ping\_timeout\`: 10 seconds

```json
{"openapi":"3.0.3","info":{"title":"Ojin Oris 1.0 Realtime API","version":"1.0.0"},"servers":[{"url":"wss://models.ojin.ai/realtime","description":"Production WebSocket endpoint"}],"security":[{"ApiKeyAuth":[]}],"components":{"securitySchemes":{"ApiKeyAuth":{"type":"apiKey","in":"header","name":"Authorization","description":"Raw API key (no `Bearer` prefix)."}},"schemas":{"SessionReadyMessage":{"type":"object","description":"Sent once by the server after the WebSocket connection is established and inference resources are allocated. **The server begins streaming frames immediately after this message.**\n\n**Format:** JSON text frame.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["sessionReady"]},"payload":{"type":"object","required":["trace_id","status","load"],"properties":{"trace_id":{"type":"string","format":"uuid","description":"Unique session identifier assigned by the server."},"status":{"type":"string","enum":["success"],"description":"Always `success`."},"load":{"type":"number","format":"float","minimum":0,"maximum":1,"description":"Current load of the inference server (0.0–1.0)."},"timestamp":{"type":"integer","format":"int64","description":"Server timestamp in milliseconds since Unix epoch."},"parameters":{"type":"object","additionalProperties":true,"nullable":true,"description":"Optional model-specific session parameters returned by the server."}}}}},"ErrorResponseMessage":{"type":"object","description":"Sent by the server when an error occurs.\n\n**Format:** JSON text frame.\n\n> **Note:** In some error conditions (e.g., no backend servers available), the server may send a plain text message instead of a structured JSON `ErrorResponse`. Your client should handle non-JSON text messages gracefully.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["errorResponse"]},"payload":{"type":"object","required":["code","message","timestamp"],"properties":{"code":{"type":"string","description":"Machine-readable error code.","enum":["AUTH_FAILED","UNAUTHORIZED","MISSING_CONFIG_ID","INVALID_MESSAGE","INVALID_HEADERS","MODEL_NOT_FOUND","BACKEND_UNAVAILABLE","RATE_LIMITED","TIMEOUT","CANCELLED","INTERNAL_ERROR","FRAME_SIZE_EXCEEDED"]},"message":{"type":"string","description":"Human-readable description of the error."},"interaction_id":{"type":"string","nullable":true,"description":"The interaction ID related to the error, if applicable."},"details":{"type":"object","additionalProperties":true,"nullable":true,"description":"Optional additional structured details about the error."},"timestamp":{"type":"integer","format":"int64","description":"Milliseconds since Unix epoch when the error was sent."}}}}}}},"paths":{"/":{"get":{"summary":"Open WebSocket connection","description":"Connect to the WebSocket endpoint providing an API key in the `Authorization` header and a `config_id` query parameter. The server upgrades the connection to WebSocket and immediately begins streaming frames after sending `SessionReady`.\n\n**Recommended WebSocket settings:**\n- `ping_interval`: 30 seconds\n- `ping_timeout`: 10 seconds","operationId":"wsHandshake","parameters":[{"in":"query","name":"config_id","required":true,"schema":{"type":"string"},"description":"Configuration ID for the persona, created in the Oris 1.0 tab of the dashboard."},{"in":"header","name":"Authorization","required":true,"schema":{"type":"string"},"description":"Your raw API key. No `Bearer` prefix."}],"responses":{"101":{"description":"WebSocket upgrade successful. After the upgrade, the server sends a `SessionReady` JSON message and begins streaming binary `InteractionResponse` frames immediately.","content":{"application/json":{"schema":{"$ref":"#/components/schemas/SessionReadyMessage"}}}},"401":{"description":"Unauthorized — invalid or missing API key.","content":{"application/json":{"schema":{"$ref":"#/components/schemas/ErrorResponseMessage"}}}}}}}}}
```

***

## Message Format

{% hint style="info" %}
**Mixed message types:** The server sends both JSON (text) and binary messages on the same WebSocket connection. Your client must check the WebSocket frame type to distinguish them:

* **Text frames (JSON):** `SessionReady`, `EndInteraction`, `CancelInteraction`, `ErrorResponse`
* **Binary frames:** `InteractionInput`, `InteractionResponse`
  {% endhint %}

{% hint style="info" %}
**Byte order:** All multi-byte integer fields in binary messages use **network byte order (big-endian)**.
{% endhint %}

***

## Messages Reference

### Server → Client Messages

## The SessionReadyMessage object

```json
{"openapi":"3.0.3","info":{"title":"Ojin Oris 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"SessionReadyMessage":{"type":"object","description":"Sent once by the server after the WebSocket connection is established and inference resources are allocated. **The server begins streaming frames immediately after this message.**\n\n**Format:** JSON text frame.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["sessionReady"]},"payload":{"type":"object","required":["trace_id","status","load"],"properties":{"trace_id":{"type":"string","format":"uuid","description":"Unique session identifier assigned by the server."},"status":{"type":"string","enum":["success"],"description":"Always `success`."},"load":{"type":"number","format":"float","minimum":0,"maximum":1,"description":"Current load of the inference server (0.0–1.0)."},"timestamp":{"type":"integer","format":"int64","description":"Server timestamp in milliseconds since Unix epoch."},"parameters":{"type":"object","additionalProperties":true,"nullable":true,"description":"Optional model-specific session parameters returned by the server."}}}}}}}}
```

## The InteractionResponseMessage object

````json
{"openapi":"3.0.3","info":{"title":"Ojin Oris 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"InteractionResponseMessage":{"type":"object","description":"Binary message containing a video frame and synchronized audio chunk. The server streams these continuously after `SessionReady` — silence frames when idle, speech frames when processing your audio.\n\n**Format:** Binary frame.\n\n**Binary structure (big-endian):**\n```\n[1 byte  ]  Is final flag   — uint8, 1 = last frame, 0 = more coming\n[16 bytes]  Interaction ID  — UUID bytes\n[8 bytes ]  Timestamp       — uint64, milliseconds since Unix epoch\n[4 bytes ]  Usage           — uint32, usage metric\n[4 bytes ]  Frame index     — uint32, 0 = silence, 1 = speech\n[4 bytes ]  Num payloads    — uint32, number of payload entries\n\nFor each payload entry:\n  [4 bytes]  Data size       — uint32, byte length of payload data\n  [1 byte ]  Payload type    — uint8, 1 = audio, 2 = image\n  [N bytes]  Payload data    — raw payload bytes\n```\n\nPython unpack: `struct.unpack('!B16sQIII', header)` for the main header, `struct.unpack('!IB', entry)` for each payload entry.","required":["is_final","interaction_id","timestamp","usage","index","payloads"],"properties":{"is_final":{"type":"boolean","description":"`true` if this is the last frame for the current interaction. `false` if more frames are coming."},"interaction_id":{"type":"string","format":"uuid","description":"UUID identifying this response. Use to correlate frames across a single interaction."},"timestamp":{"type":"integer","format":"int64","description":"Milliseconds since Unix epoch when the frame was sent."},"usage":{"type":"integer","format":"int32","description":"Usage metric for this response."},"index":{"type":"integer","format":"int32","enum":[0,1],"description":"Frame type. `0` = silence frame (idle animation, no speech input). `1` = speech frame (lip-synced to your audio). **Drop silence frames (`0`) to manage buffer size. Never drop speech frames (`1`).**"},"payloads":{"type":"array","description":"List of payload entries in this frame. Each frame typically contains one audio entry and one image entry.","items":{"type":"object","required":["payload_type","data"],"properties":{"payload_type":{"type":"integer","enum":[1,2],"description":"`1` = audio (PCM int16, 16kHz mono, 1,280 bytes = 640 samples = 40ms). `2` = image (JPEG-encoded, resolution depends on config e.g. 1280×720)."},"data_size":{"type":"integer","format":"int32","description":"Byte length of the payload data."},"data":{"type":"string","format":"binary","description":"Raw payload bytes. For audio: PCM int16 bytes. For image: JPEG bytes."}}}}}}}}}
````

## The ErrorResponseMessage object

```json
{"openapi":"3.0.3","info":{"title":"Ojin Oris 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"ErrorResponseMessage":{"type":"object","description":"Sent by the server when an error occurs.\n\n**Format:** JSON text frame.\n\n> **Note:** In some error conditions (e.g., no backend servers available), the server may send a plain text message instead of a structured JSON `ErrorResponse`. Your client should handle non-JSON text messages gracefully.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["errorResponse"]},"payload":{"type":"object","required":["code","message","timestamp"],"properties":{"code":{"type":"string","description":"Machine-readable error code.","enum":["AUTH_FAILED","UNAUTHORIZED","MISSING_CONFIG_ID","INVALID_MESSAGE","INVALID_HEADERS","MODEL_NOT_FOUND","BACKEND_UNAVAILABLE","RATE_LIMITED","TIMEOUT","CANCELLED","INTERNAL_ERROR","FRAME_SIZE_EXCEEDED"]},"message":{"type":"string","description":"Human-readable description of the error."},"interaction_id":{"type":"string","nullable":true,"description":"The interaction ID related to the error, if applicable."},"details":{"type":"object","additionalProperties":true,"nullable":true,"description":"Optional additional structured details about the error."},"timestamp":{"type":"integer","format":"int64","description":"Milliseconds since Unix epoch when the error was sent."}}}}}}}}
```

### Client → Server Messages

## The InteractionInputMessage object

````json
{"openapi":"3.0.3","info":{"title":"Ojin Oris 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"InteractionInputMessage":{"type":"object","description":"Binary message for sending speech audio to the server. **Only send speech audio** — do not send silence or padding. The server generates silence frames automatically.\n\n**Format:** Binary frame.\n\n**Binary structure (big-endian):**\n```\n[1 byte ]  Payload type   — uint8, always 1 for audio\n[8 bytes]  Timestamp      — uint64, milliseconds since Unix epoch\n[4 bytes]  Params size    — uint32, byte length of JSON params (0 if none)\n[N bytes]  Params JSON    — UTF-8 JSON (only present if params size > 0)\n[M bytes]  Audio payload  — raw PCM int16 speech audio\n```\n\nPython pack: `struct.pack('!BQI', payload_type, timestamp, params_size)`","required":["payload_type","timestamp","params_size","audio_payload"],"properties":{"payload_type":{"type":"integer","enum":[1],"description":"Always `1` for audio."},"timestamp":{"type":"integer","format":"int64","description":"Milliseconds since Unix epoch when the message was sent."},"params_size":{"type":"integer","format":"int32","minimum":0,"description":"Byte length of the JSON params block. `0` if no params."},"params":{"type":"object","nullable":true,"description":"Optional per-chunk parameters. Overrides session defaults for this audio chunk.","properties":{"speech_filter_amount":{"type":"number","format":"float","default":5,"description":"Smoothing for speech animation. Higher = smoother, less responsive."},"idle_filter_amount":{"type":"number","format":"float","default":1000,"description":"Smoothing for idle animation."},"idle_mouth_opening_scale":{"type":"number","format":"float","default":0,"description":"Mouth movement scale during idle. `0.0` = closed."},"speech_mouth_opening_scale":{"type":"number","format":"float","default":1,"description":"Mouth movement scale during speech. `1.0` = full movement."},"client_frame_index":{"type":"integer","format":"int32","default":0,"description":"Frame index the client is currently displaying. Helps the server manage silence-to-speech transitions smoothly."}}},"audio_payload":{"type":"string","format":"binary","description":"Raw PCM int16 speech audio. Requirements: 16,000 Hz sample rate, mono (1 channel), little-endian int16 samples. Entire message must be under 512 KB. Recommended chunk size: 400ms = 6,400 samples = 12,800 bytes."}}}}}}
````

## The EndInteractionMessage object

```json
{"openapi":"3.0.3","info":{"title":"Ojin Oris 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"EndInteractionMessage":{"type":"object","description":"Signal graceful end of the session. The server finishes processing all queued audio and sends remaining frames, with the last frame marked `is_final: true`.\n\n**Format:** JSON text frame.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["endInteraction"]},"payload":{"type":"object","required":["timestamp"],"properties":{"timestamp":{"type":"integer","format":"int64","description":"Milliseconds since Unix epoch when the message was sent."}}}}}}}}
```

## The CancelInteractionMessage object

```json
{"openapi":"3.0.3","info":{"title":"Ojin Oris 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"CancelInteractionMessage":{"type":"object","description":"Immediately stop processing and discard all remaining frames. No final frame is sent. Use for interruptions (e.g., user starts speaking while the persona is talking).\n\n**Format:** JSON text frame.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["cancelInteraction"]},"payload":{"type":"object","properties":{"timestamp":{"type":"integer","format":"int64","nullable":true,"description":"Optional. Milliseconds since Unix epoch when the message was sent."}}}}}}}}
```

***

## Message Details

### InteractionInput (Client → Server, Binary)

Binary message for sending speech audio to the server. **Only send speech audio** — do not send silence or padding.

**Binary structure:**

```
[1 byte ]  Payload type      — uint8, always 1 for audio
[8 bytes]  Timestamp          — uint64, milliseconds since Unix epoch
[4 bytes ]  Params size        — uint32, byte length of the JSON params block (0 if no params)
[N bytes]  Params JSON        — UTF-8 encoded JSON (only present if params size > 0)
[M bytes]  Audio payload      — raw PCM int16 speech audio data
```

**Header fields** use **big-endian** byte order. The PCM audio samples in the payload use **little-endian** (standard for PCM int16). In Python: `struct.pack('!BQI', payload_type, timestamp, params_size)`.

**Audio requirements:**

| Property         | Value                                              |
| ---------------- | -------------------------------------------------- |
| Format           | PCM signed 16-bit integers (little-endian samples) |
| Sample rate      | 16,000 Hz                                          |
| Channels         | 1 (mono)                                           |
| Max message size | 512 KB (entire binary message including header)    |

**Recommended streaming pattern:**

Forward speech audio to the server as it arrives from your TTS service — no need to buffer or accumulate before sending (buffering only needed for playback). Send the audio in realtime to achieve a realtime streaming and smooth playback experience.

```python
import struct, json, time

def build_audio_message(audio_bytes):
    """Build a binary InteractionInput message."""
    header = struct.pack('!BQI', 1, int(time.time() * 1000), 0)
    return header + audio_bytes
```

***

### InteractionResponse (Server → Client, Binary)

Binary message containing a video frame and synchronized audio. The server streams these continuously after `SessionReady`. **Frames always arrive in order.**

**Binary structure:**

```
[1 byte  ]  Is final flag     — uint8, 1 = last frame for this interaction, 0 = more coming
[16 bytes]  Interaction ID     — UUID bytes (big-endian)
[8 bytes ]  Timestamp          — uint64, milliseconds since Unix epoch
[4 bytes ]  Usage              — uint32, usage metric for this response
[4 bytes ]  Frame index        — uint32, 0 = silence, 1 = speech
[4 bytes ]  Num payloads       — uint32, number of payload entries that follow

For each payload entry:
  [4 bytes]  Data size          — uint32, byte length of the payload data only
  [1 byte ]  Payload type       — uint8, 1 = audio, 2 = image
  [N bytes]  Payload data       — raw payload bytes
```

All multi-byte integers are **big-endian**. In Python: `struct.unpack('!B16sQIII', header_bytes)` for the main header, `struct.unpack('!IB', entry_bytes)` for each payload entry.

**Frame index:**

| Index | Meaning                                                 |
| ----- | ------------------------------------------------------- |
| `0`   | **Silence frame** — persona at rest with idle animation |
| `1`   | **Speech frame** — lip-synced animation from your audio |

**Payload types:**

| Type      | Format                | Typical size per frame                                 |
| --------- | --------------------- | ------------------------------------------------------ |
| 1 (audio) | PCM int16, 16kHz mono | **1,280 bytes** (640 samples = 40ms at 25fps)          |
| 2 (image) | JPEG-encoded image    | Variable (resolution depends on config, e.g. 1280×720) |

**Parsing example:**

```python
import struct, uuid

HEADER_FMT = '!B16sQIII'
HEADER_SIZE = struct.calcsize(HEADER_FMT)   # 37 bytes
ENTRY_FMT = '!IB'
ENTRY_SIZE = struct.calcsize(ENTRY_FMT)     # 5 bytes

def parse_response(data):
    is_final, uuid_bytes, timestamp, usage, index, num_payloads = \
        struct.unpack(HEADER_FMT, data[:HEADER_SIZE])

    offset = HEADER_SIZE
    image = audio = None

    for _ in range(num_payloads):
        size, ptype = struct.unpack(ENTRY_FMT, data[offset:offset + ENTRY_SIZE])
        offset += ENTRY_SIZE
        payload = data[offset:offset + size]
        offset += size

        if ptype == 2:
            image = payload   # JPEG bytes
        elif ptype == 1:
            audio = payload   # PCM int16 bytes

    return {
        'is_final': bool(is_final),
        'index': index,              # 0 = silence, 1 = speech
        'image': image,
        'audio': audio,
    }
```

***

### EndInteraction vs CancelInteraction

| Message             | Purpose         | Server behavior                                                                | Use case          |
| ------------------- | --------------- | ------------------------------------------------------------------------------ | ----------------- |
| `EndInteraction`    | Graceful finish | Completes processing, sends remaining frames with last marked `is_final: true` | Session end       |
| `CancelInteraction` | Immediate stop  | Stops processing, discards remaining frames                                    | User interruption |

***

### ErrorResponse (Server → Client, JSON)

{% hint style="warning" %}
**Plain text errors:** In some error conditions (e.g., no backend servers available), the server may send a plain text message instead of a structured JSON `ErrorResponse`. Your client should handle non-JSON text messages gracefully.
{% endhint %}

**Error codes:**

| Code                  | Description                              |
| --------------------- | ---------------------------------------- |
| `AUTH_FAILED`         | Invalid API key                          |
| `UNAUTHORIZED`        | Caller lacks permission                  |
| `MISSING_CONFIG_ID`   | `config_id` query parameter not provided |
| `INVALID_MESSAGE`     | Malformed or unsupported message payload |
| `INVALID_HEADERS`     | Missing or invalid headers               |
| `MODEL_NOT_FOUND`     | Config ID not found or invalid           |
| `BACKEND_UNAVAILABLE` | No healthy inference backend available   |
| `RATE_LIMITED`        | Too many requests                        |
| `TIMEOUT`             | Operation exceeded processing time       |
| `CANCELLED`           | Interaction cancelled by client          |
| `INTERNAL_ERROR`      | Unexpected server error                  |
| `FRAME_SIZE_EXCEEDED` | Message exceeded 512KB limit             |

***

## Rate Limits & Constraints

| Constraint       | Value                                                   |
| ---------------- | ------------------------------------------------------- |
| Rate limit       | 6 requests per second                                   |
| Max message size | 512 KB per message                                      |
| Video output     | 25 fps target (generated slightly faster than realtime) |

Exceeding limits results in an `ErrorResponse` with code `RATE_LIMITED`.

***

## Best Practices

### Audio Input

* **Send one silence frame first** to start the conversation, then send speech audio when available
* Forward speech audio to the server as it arrives from your TTS service — no need to buffer or accumulate before sending (buffering only needed for playback). Send the audio in realtime to achieve a realtime streaming and smooth playback experience.

### Buffer Management

* Play frames at **25 fps** (40ms per frame)
* The server generates slightly faster than realtime — **you must drop silence frames** to prevent the buffer from growing
* **Drop strategy:** when the buffer grows beyond your target size, skip 1 out of every 2 silence frames (`index == 0`) until the buffer shrinks back
* **Never drop speech frames** (`index == 1`)
* Tune your target buffer size based on your network conditions — keep it as low as possible for minimal latency

### Audio and Video Synchronization

Each `InteractionResponse` contains both a JPEG image and a PCM audio chunk. However, the **recommended approach for audio playback** is:

1. **Buffer your TTS/source audio locally** as it arrives from your speech service (for later playback only. You do not need to buffer it to send it to Ojin)
2. **Forward TTS audio to Ojin immediately** as it arrives from your speech service
3. **Wait for the first speech frame** (`index == 1`) to arrive from Ojin
4. **Start playing your buffered TTS audio** at that moment
5. **Stop audio playback** when you receive the first silence frame (`index == 0`) after speech
6. **Render video** from every frame regardless of type

Sending is immediate, but playback is gated on the first speech frame. This ensures audio and video stay in sync.

```python
# When TTS audio arrives from your speech service:
speech_audio_buffer.extend(tts_audio_chunk)      # buffer locally for playback
await ojin.send_audio(tts_audio_chunk)            # send to Ojin for lip-sync

# In your video playback loop:
frame = buffer.popleft()

if frame.index == 1 and not audio_playing:
    start_audio_playback(speech_audio_buffer)     # begin draining the buffer
    audio_playing = True

if frame.index == 0 and audio_playing:
    stop_audio_playback()
    audio_playing = False

render_video(frame.image)                         # always render the video
```

### Error Handling

* Handle both JSON `ErrorResponse` messages and plain text error strings
* Implement exponential backoff for reconnection
* Monitor server `load` in the `SessionReady` message

### Interruption Handling

* Use `CancelInteraction` for immediate stops (e.g., user interrupts the bot)
* Use `EndInteraction` for graceful session endings
* Clear your frame buffer on interruption

***

## Complete Example

```python
import asyncio
import json
import struct
import time
from collections import deque
import numpy as np
import websockets
from dotenv import load_dotenv
import os

load_dotenv()

API_KEY = os.getenv("OJIN_API_KEY", "")
CONFIG_ID = os.getenv("OJIN_CONFIG_ID", "")
URL = f"wss://models.ojin.ai/realtime?config_id={CONFIG_ID}"

SAMPLE_RATE = 16000
FPS = 25
TARGET_BUFFER = 10  # Tune based on your network conditions

def build_audio_message(audio_bytes):
    """Build a binary InteractionInput message."""
    header = struct.pack('!BQI', 1, int(time.time() * 1000), 0)
    return header + audio_bytes

def parse_response(data):
    """Parse a binary InteractionResponse message."""
    fmt = '!B16sQIII'
    hdr_size = struct.calcsize(fmt)
    is_final, uuid_bytes, timestamp, usage, index, num_payloads = \
        struct.unpack(fmt, data[:hdr_size])

    offset = hdr_size
    image = audio = None

    for _ in range(num_payloads):
        size, ptype = struct.unpack('!IB', data[offset:offset + 5])
        offset += 5
        if ptype == 2:
            image = data[offset:offset+size]
        elif ptype == 1:
            audio = data[offset:offset+size]
        offset += size

    return {
        'is_final': bool(is_final),
        'index': index,              # 0 = silence, 1 = speech
        'image': image,
        'audio': audio,
    }

async def main():
    headers = {"Authorization": API_KEY}
    # For older websockets versions, use extra_headers instead.
    async with websockets.connect(URL, additional_headers=headers, ping_interval=30) as ws:
        # 1. Wait for SessionReady — server starts streaming frames immediately after
        msg = json.loads(await ws.recv())
        assert msg["type"] == "sessionReady"
        print(f"Session ready: {msg['payload']}")

        buffer = deque()
        skip_counter = 0
        playback_started = False
        frame_count = 0
        audio_sent = False

        # 2. Send one silence frame to start processing
        audio_data = b"\00\00" * (SAMPLE_RATE // 25)
        await ws.send(build_audio_message(audio_data))

        # 3. Receive and process frames
        async for data in ws:
            if isinstance(data, str):
                msg = json.loads(data)
                if msg.get("type") == "errorResponse":
                    print(f"Error: {msg['payload']}")
                    break
                continue

            frame = parse_response(data)
            buffer.append(frame)
            frame_count += 1

            # Wait for initial buffer before playback
            if not playback_started:
                if len(buffer) >= TARGET_BUFFER:
                    playback_started = True
                    print(f"Buffer filled ({TARGET_BUFFER} frames), starting playback")
                continue

            # Consume one frame
            if buffer:
                play_frame = buffer.popleft()

                # Drop excess silence frames: skip 1 out of 2 when buffer is too large
                if len(buffer) > TARGET_BUFFER and play_frame['index'] == 0:
                    skip_counter += 1
                    if skip_counter % 2 == 0 and len(buffer) > 0:
                        buffer.popleft()  # drop one silence frame

                kind = "silence" if play_frame['index'] == 0 else "speech"
                print(f"[{kind}] frame #{frame_count}, buffer={len(buffer)}")

                # In a real app: render play_frame['image'] and play play_frame['audio']

            # Demo: send speech audio after receiving some silence frames
            if frame_count == 50 and not audio_sent:
                t = np.linspace(0, 1.0, SAMPLE_RATE, endpoint=False)
                audio_data = (32767 * 0.5 * np.sin(2 * np.pi * 440 * t)).astype(np.int16)
                chunk_size = SAMPLE_RATE # 1s chunks
                for i in range(0, len(audio_data), chunk_size):
                    chunk = audio_data[i:i + chunk_size]
                    await ws.send(build_audio_message(chunk.tobytes()))
                audio_sent = True
                print("Sent 1 second of speech audio")

            if frame_count > 200:
                break

asyncio.run(main())
```

***

## Troubleshooting

### Connection Issues

* ✓ Verify API key and config ID
* ✓ Check that config exists in dashboard
* ✓ Ensure network allows WebSocket connections (port 443)
* ✓ Check the `Authorization` header uses the raw API key (no `Bearer` prefix)

### No Frames Received

* ✓ Confirm you received `SessionReady` — frames start streaming immediately after
* ✓ If sending speech audio: verify format is 16kHz, int16, mono with big-endian message header
* ✓ Check message size < 512KB

### Choppy Playback

* ✓ Play at 25fps (40ms per frame)
* ✓ Buffer some frames before starting playback
* ✓ Check network latency and jitter

### Growing Latency

* ✓ You **must** drop silence frames — the server generates faster than realtime
* ✓ Skip 1 out of 2 silence frames (`index == 0`) when buffer grows beyond your target
* ✓ Never drop speech frames (`index == 1`)
* ✓ During speech bursts the buffer will grow temporarily — this is expected, trim silence frames afterward

### Frame Lag During Speech

* ✓ Reduce `speech_filter_amount` parameter (lower = more responsive, less smooth)

***


# ojin/flashhead-lite-1.0

> A lifelike persona model that transforms reference videos into natural animated personas

## Overview

The ojin/flashhead-lite-1.0 model is our flagship persona generation technology that creates realistic, expressive digital humans from a reference video. It excels at producing natural facial animations, lip-syncing, and emotional expressions that bring your persona to life with synchronized speech.

## Key Features

* **Full persona look control** - Generate a persona based on any video reference, the persona will behave exactly the same
* **No training required** - You don't need to wait for your persona to be ready, as soon as the reference video is uploaded, you can start using it
* **Natural Lip-Syncing** - Precise lip movements synchronized with speech audio
* **Emotional Expressions** - Support for multiple emotional states and expressions
* **Low Latency** - Fast processing for real-time applications
* **High Resolution** - Support for up to 720p output resolution

## Quick Start

Getting started with ojin/flashhead-lite-1.0 is simple:

1. [**Create an API key**](https://docs.ojin.ai/getting-started/authentication) - Set up authentication for the Ojin platform
2. [**Use a persona template**](https://docs.ojin.ai/models/flashhead-lite/using-persona-template) - Use a persona template to generate your persona in seconds
3. [**Integrate with your application**](https://docs.ojin.ai/models/flashhead-lite/integrations) - Use either Pipecat or WebSocket API

## Use Cases

* **Virtual Assistants** - Create responsive customer service personas
* **Educational Content** - Develop engaging tutors and instructors
* **Entertainment** - Produce animated characters for games and media
* **Presentations** - Transform static slides into dynamic video presentations
* **Healthcare** - Build empathetic virtual health assistants


# Get started

This guide explains how to integrate the `ojin/flashhead-lite-1.0` persona model into your applications using either Pipecat or WebSockets

## Prerequisites

1. An Ojin account with an active API key, if you don't have one [get your API key](https://docs.ojin.ai/getting-started/authentication)
2. [Create a Persona](https://docs.ojin.ai/models/flashhead-lite/creating-persona) or use a [Persona Template](https://docs.ojin.ai/models/flashhead-lite/using-persona-template)
3. Save the Persona Configuration ID from the dashboard
4. Integrate with your application using either [Pipecat](#pipecat-integration) or [WebSockets](#websocket-integration)

{% hint style="info" %}
**Staging deployments:** For secure, low-latency video applications, connect to the real-time WebSocket API from a backend server rather than a front-end client (to keep your API key secure and leverage a network transport appropriate for real-time video media delivery under varying network conditions). Typically, WebRTC is used to deliver the final media stream to end users for smooth, reliable, low-latency playback.
{% endhint %}

{% tabs %}
{% tab title="Pipecat" %}

#### Pipecat Integration

[Pipecat](https://github.com/pipecat-ai/pipecat) is a powerful open source framework for building conversational AI pipelines. The `ojin/flashhead-lite-1.0` model integrates seamlessly with Pipecat through our dedicated `OjinVideoService`.

**Option 1: Clone pipecat repository and checkout the ready to use** [**ojin-chatbot example**](https://github.com/journee-live/pipecat-ojin/tree/main/examples/ojin-chatbot)

To start using it, create a python virtual environment on it and install requirements

```bash
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

Create a `.env` file and add your Ojin API key and persona ID

```bash
OJIN_API_KEY="your_api_key_here"
OJIN_CONFIG_ID="your_persona_id_here"
```

Then just run the [mock\_bot.py](https://github.com/journee-live/pipecat-ojin/blob/main/examples/ojin-chatbot/mock/mock_bot.py) to check that your ojin setup is correct and see a generation out of a wav file:

```bash
python mock/mock_bot.py
```

Alternatively, you can configure all required environment variables for the services used in this example (such as Hume) by referring to env.example. Once configured, you can interact with a conversational, human-like bot using your local audio input/output

```bash
python bot.py
```

**How It Works**

1. The microphone listens for speech input
2. Voice Activity Detection identifies speech segments
3. User audio is sent to Hume to get an LLM response using their Speech-To-Speech service.
4. The OjinVideoService animates your persona based on the STS audio.
5. Video frames are received and displayed in real-time together with the audio.

{% hint style="info" %}
You can customize the pipeline by adding or removing components, or by adjusting their parameters to suit your needs.
{% endhint %}
{% endtab %}

{% tab title="WebSocket" %}

#### WebSocket Integration

For WebSocket integration check our [API Reference →](https://docs.ojin.ai/models/flashhead-lite/api)
{% endtab %}
{% endtabs %}

## Next Steps

#### API Reference

Dive deeper into the model API for custom integrations

[API Reference →](https://docs.ojin.ai/models/flashhead-lite/api)


# Using a Persona Template

Learn how to create a persona out of a template to get started ASAP with your application.

## Prerequisites

* An Ojin account with an active API key

## Creating a Persona through the Dashboard

The simplest way to create a persona is through the Ojin Dashboard:

1. Log in to the [Ojin Dashboard](https://ojin.ai)
2. Navigate to the [**Flashhead Lite 1.0**](https://ojin.ai/models/ojin/flashhead-lite-1.0) section
3. Navigate to [**Configs**](https://ojin.ai/models/ojin/flashhead-lite-1.0/configs) sub-section
4. Select a persona template and press **Copy Template**
5. Open the newly created model configuration and save **Model Config ID** parameter which will be used by your application
6. You can now integrate it through the [**model API endpoints**](https://ojin.ai/models/ojin/flashhead-lite-1.0/docs)

## Next Steps

Once your persona is ready, you can:

#### Integration Guide

Learn how to integrate your persona using Pipecat or WebSocket

[View Integration Guide →](https://docs.ojin.ai/models/flashhead-lite/integrations)

#### API Reference

Explore the complete API documentation

[View API Reference →](https://docs.ojin.ai/models/flashhead-lite/api)


# Creating a custom Persona

Before you can start integrating with the ojin/flashhead-lite-1.0 model, you'll need to create a persona configuration. This guide walks you through the process of creating a persona that looks exactly how you want.

## Prerequisites

* An Ojin account with an active API key
* A high-quality reference video of the persona you want to animate (check [Reference Video best practices](#reference-video-best-practices) for more details)

{% hint style="info" %}
For best results, follow reference video best practices next
{% endhint %}

## Creating a Persona configuration through the Dashboard

The simplest way to create a persona is through the Ojin Dashboard:

1. Log in to the [Ojin Dashboard](https://ojin.ai)
2. Select the [**ojin/flashhead-lite-1.0**](https://ojin.ai/models/ojin/flashhead-lite) model
3. Navigate to [**Configs**](https://ojin.ai/models/ojin/flashhead-lite/configs) sub-section
4. Press **New Configuration** button to create a new configuration
5. Fill in required fields and upload a reference video. Make sure to follow the instructions below on how the video should look.
6. Click **Create Configuration**
7. Open your newly created configuration and copy **Model Config ID** parameter which will be used by your application

### Reference Video best practices

* **Video content**:
  * The video will be used as the base for your persona's idle loop and speech. Make sure the movements of the person are balanced — not too expressive, not too still — so that it fits both states
  * Mouth should be closed during the entire video, but smiles or gestures are not a problem
  * The eyes should be looking directly into the camera.
  * A minimum video length of 15 seconds is recommended for best results
  * The length of the video should not be more than 30 seconds.
* **Resolution**: Use 1080p resolution for the reference video and at least 25 fps
* **Lighting**: Ensure even lighting with no harsh shadows
* **Face Position**: The face should be clearly visible and centered
* **Background**: Simple backgrounds work best
* **Accessories**: Avoid sunglasses or items that obscure facial features

## Next Steps

Once your persona is ready, you can:

#### Integration Guide

Learn how to integrate your persona using Pipecat or WebSocket

[View Integration Guide →](https://docs.ojin.ai/models/flashhead-lite/integrations)

#### API Reference

Explore the complete API documentation

[View API Reference →](https://docs.ojin.ai/models/flashhead-lite/api)


# API Reference

## Overview

Real-time talking head synthesis API. Send speech audio, receive synchronized video and audio frames.

After connecting and receiving `SessionReady`, you should send one initial audio message with silence (one frame worth of silent audio) to start the interaction on the server. The server then immediately begins streaming video and audio frames at 25fps. When no speech audio has been sent, the server generates **silence frames** (persona at rest with idle animation). When you send speech audio, the server generates **speech frames** with lip-synced animation synchronized to your audio.

After the initial silence frame, you only need to send speech audio — no additional silence, padding, or keep-alive messages are required.

{% hint style="info" %}
**Staging deployments:** For secure, low-latency video applications, connect to the real-time WebSocket API from a backend server rather than a front-end client (to keep your API key secure and leverage a network transport appropriate for real-time video media delivery under varying network conditions). Typically, WebRTC is used to deliver the final media stream to end users for smooth, reliable, low-latency playback.
{% endhint %}

***

## How It Works

1. **Connect** to the WebSocket endpoint with your API key and config ID
2. **Receive `SessionReady`** — the server has allocated inference resources for your session
3. **Send** initial audio message with silence for one frame.
4. **The server starts streaming frames immediately** — silence frames with idle animation
5. **Send speech audio** whenever it becomes available (e.g., TTS output from your language model) — also buffer it locally for playback
6. **Receive speech frames** — the server transitions to lip-synced animation and returns to silence frames when audio runs out
7. **Render video frames** at 25fps, dropping excess silence frames to manage buffer size
8. **Start playing your buffered TTS audio** when the first speech frame (`index == 1`) arrives from Ojin — stop when speech ends

### Frame Types

Every frame arrives as a binary `InteractionResponse` containing both a JPEG image and a PCM audio chunk. Frames are always delivered in order. The `index` field identifies the frame type:

| Index | Frame type  | Description                                                                                 |
| ----- | ----------- | ------------------------------------------------------------------------------------------- |
| `0`   | **Silence** | Persona at rest with idle animation. Generated automatically when no speech audio is queued |
| `1`   | **Speech**  | Lip-synced animation generated from your audio input                                        |

### Faster-than-Realtime Generation

The server generates frames slightly **faster than realtime** to build a client-side buffer that prevents stuttering during speech. During speech bursts, generation is even faster. This means your frame buffer will grow over time if you don't manage it.

**You must drop silence frames to prevent unbounded buffer growth.** When your buffer starts growing beyond what you need for smooth playback, skip 1 out of every 2 silence frames (`index == 0`) until the buffer shrinks back down. Never drop speech frames (`index == 1`).

The right buffer target depends on your network conditions and latency requirements — start by observing your buffer size during playback and tuning from there. Keep it as low as possible to minimize latency, but high enough to absorb network jitter without starving playback.

```python
# When consuming frames from the buffer:
frame = buffer.popleft()

# If buffer is growing and this is a silence frame, skip every other one
if len(buffer) > target_buffer_size and frame.index == 0:
    skip_counter += 1
    if skip_counter % 2 == 0 and buffer:
        buffer.popleft()  # drop one silence frame
```

***

## Connection Flow

{% @mermaid/diagram content="sequenceDiagram
participant Client
participant Server

```
Note over Client,Server: Connection
Client->>Server: WebSocket Connect
Server->>Client: SessionReady (JSON)
Client->>Server: InteractionInput (silence audio for one frame)

Note over Client,Server: Server Streams Immediately
Server->>Client: Frame (silence, index=0)
Server->>Client: Frame (silence, index=0)
Server->>Client: Frame (silence, index=0)

Note over Client,Server: Client Sends Speech Audio
Client->>Server: InteractionInput (TTS audio chunk 1)
Client->>Server: InteractionInput (TTS audio chunk 2)

Note over Client,Server: Server Transitions to Speech
Server->>Client: Frame (speech, index=1)
Server->>Client: Frame (speech, index=1)
Server->>Client: Frame (speech, index=1)
Note right of Server: Burst: faster than realtime

Note over Client,Server: Audio Runs Out → Back to Silence
Server->>Client: Frame (silence, index=0)
Server->>Client: Frame (silence, index=0)
Note right of Client: Client drops excess silence frames" %}
```

***

## WebSocket Handshake

## Open WebSocket connection

> Connect to the WebSocket endpoint providing an API key in the \`Authorization\` header and a \`config\_id\` query parameter. The server upgrades the connection to WebSocket and immediately begins streaming frames after sending \`SessionReady\`.\
> \
> \*\*Recommended WebSocket settings:\*\*\
> \- \`ping\_interval\`: 30 seconds\
> \- \`ping\_timeout\`: 10 seconds

```json
{"openapi":"3.0.3","info":{"title":"Ojin Flashhead Lite 1.0 Realtime API","version":"1.0.0"},"servers":[{"url":"wss://models.ojin.foo/realtime","description":"Staging WebSocket endpoint"}],"security":[{"ApiKeyAuth":[]}],"components":{"securitySchemes":{"ApiKeyAuth":{"type":"apiKey","in":"header","name":"Authorization","description":"Raw API key (no `Bearer` prefix)."}},"schemas":{"SessionReadyMessage":{"type":"object","description":"Sent once by the server after the WebSocket connection is established and inference resources are allocated. **The server begins streaming frames immediately after this message.**\n\n**Format:** JSON text frame.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["sessionReady"]},"payload":{"type":"object","required":["trace_id","status","load"],"properties":{"trace_id":{"type":"string","format":"uuid","description":"Unique session identifier assigned by the server."},"status":{"type":"string","enum":["success"],"description":"Always `success`."},"load":{"type":"number","format":"float","minimum":0,"maximum":1,"description":"Current load of the inference server (0.0–1.0)."},"timestamp":{"type":"integer","format":"int64","description":"Server timestamp in milliseconds since Unix epoch."},"parameters":{"type":"object","additionalProperties":true,"nullable":true,"description":"Optional model-specific session parameters returned by the server."}}}}},"ErrorResponseMessage":{"type":"object","description":"Sent by the server when an error occurs.\n\n**Format:** JSON text frame.\n\n> **Note:** In some error conditions (e.g., no backend servers available), the server may send a plain text message instead of a structured JSON `ErrorResponse`. Your client should handle non-JSON text messages gracefully.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["errorResponse"]},"payload":{"type":"object","required":["code","message","timestamp"],"properties":{"code":{"type":"string","description":"Machine-readable error code.","enum":["AUTH_FAILED","UNAUTHORIZED","MISSING_CONFIG_ID","INVALID_MESSAGE","INVALID_HEADERS","MODEL_NOT_FOUND","BACKEND_UNAVAILABLE","RATE_LIMITED","TIMEOUT","CANCELLED","INTERNAL_ERROR","FRAME_SIZE_EXCEEDED"]},"message":{"type":"string","description":"Human-readable description of the error."},"interaction_id":{"type":"string","nullable":true,"description":"The interaction ID related to the error, if applicable."},"details":{"type":"object","additionalProperties":true,"nullable":true,"description":"Optional additional structured details about the error."},"timestamp":{"type":"integer","format":"int64","description":"Milliseconds since Unix epoch when the error was sent."}}}}}}},"paths":{"/":{"get":{"summary":"Open WebSocket connection","description":"Connect to the WebSocket endpoint providing an API key in the `Authorization` header and a `config_id` query parameter. The server upgrades the connection to WebSocket and immediately begins streaming frames after sending `SessionReady`.\n\n**Recommended WebSocket settings:**\n- `ping_interval`: 30 seconds\n- `ping_timeout`: 10 seconds","operationId":"wsHandshake","parameters":[{"in":"query","name":"config_id","required":true,"schema":{"type":"string"},"description":"Configuration ID for the persona, created in the Flashhead Lite 1.0 tab of the dashboard."},{"in":"header","name":"Authorization","required":true,"schema":{"type":"string"},"description":"Your raw API key. No `Bearer` prefix."}],"responses":{"101":{"description":"WebSocket upgrade successful. After the upgrade, the server sends a `SessionReady` JSON message and begins streaming binary `InteractionResponse` frames immediately.","content":{"application/json":{"schema":{"$ref":"#/components/schemas/SessionReadyMessage"}}}},"401":{"description":"Unauthorized — invalid or missing API key.","content":{"application/json":{"schema":{"$ref":"#/components/schemas/ErrorResponseMessage"}}}}}}}}}
```

***

## Message Format

{% hint style="info" %}
**Mixed message types:** Both JSON (text) and binary messages are exchanged on the same WebSocket connection. Your client must check the WebSocket frame type to distinguish them:

* **Text frames (JSON):** `SessionReady`, `ErrorResponse` (server → client), `EndInteraction`, `CancelInteraction` (client → server)
* **Binary frames:** `InteractionResponse` (server → client), `InteractionInput` (client → server)
  {% endhint %}

{% hint style="info" %}
**Byte order:** All multi-byte integer fields in binary messages use **network byte order (big-endian)**.
{% endhint %}

***

## Messages Reference

### Server → Client Messages

## The SessionReadyMessage object

```json
{"openapi":"3.0.3","info":{"title":"Ojin Flashhead Lite 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"SessionReadyMessage":{"type":"object","description":"Sent once by the server after the WebSocket connection is established and inference resources are allocated. **The server begins streaming frames immediately after this message.**\n\n**Format:** JSON text frame.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["sessionReady"]},"payload":{"type":"object","required":["trace_id","status","load"],"properties":{"trace_id":{"type":"string","format":"uuid","description":"Unique session identifier assigned by the server."},"status":{"type":"string","enum":["success"],"description":"Always `success`."},"load":{"type":"number","format":"float","minimum":0,"maximum":1,"description":"Current load of the inference server (0.0–1.0)."},"timestamp":{"type":"integer","format":"int64","description":"Server timestamp in milliseconds since Unix epoch."},"parameters":{"type":"object","additionalProperties":true,"nullable":true,"description":"Optional model-specific session parameters returned by the server."}}}}}}}}
```

## The InteractionResponseMessage object

````json
{"openapi":"3.0.3","info":{"title":"Ojin Flashhead Lite 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"InteractionResponseMessage":{"type":"object","description":"Binary message containing a video frame and synchronized audio chunk. The server streams these continuously after `SessionReady` — silence frames when idle, speech frames when processing your audio.\n\n**Format:** Binary frame.\n\n**Binary structure (big-endian):**\n```\n[1 byte  ]  Is final flag   — uint8, 1 = last frame, 0 = more coming\n[16 bytes]  Interaction ID  — UUID bytes\n[8 bytes ]  Timestamp       — uint64, milliseconds since Unix epoch\n[4 bytes ]  Usage           — uint32, usage metric\n[4 bytes ]  Frame index     — uint32, 0 = silence, 1 = speech\n[4 bytes ]  Num payloads    — uint32, number of payload entries\n\nFor each payload entry:\n  [4 bytes]  Data size       — uint32, byte length of payload data\n  [1 byte ]  Payload type    — uint8, 1 = audio, 2 = image\n  [N bytes]  Payload data    — raw payload bytes\n```\n\nPython unpack: `struct.unpack('!B16sQIII', header)` for the main header, `struct.unpack('!IB', entry)` for each payload entry.","required":["is_final","interaction_id","timestamp","usage","index","payloads"],"properties":{"is_final":{"type":"boolean","description":"`true` if this is the last frame for the current interaction. `false` if more frames are coming."},"interaction_id":{"type":"string","format":"uuid","description":"UUID identifying this response. Use to correlate frames across a single interaction."},"timestamp":{"type":"integer","format":"int64","description":"Milliseconds since Unix epoch when the frame was sent."},"usage":{"type":"integer","format":"int32","description":"Usage metric for this response."},"index":{"type":"integer","format":"int32","enum":[0,1],"description":"Frame type. `0` = silence frame (idle animation, no speech input). `1` = speech frame (lip-synced to your audio). **Drop silence frames (`0`) to manage buffer size. Never drop speech frames (`1`).**"},"payloads":{"type":"array","description":"List of payload entries in this frame. Each frame typically contains one audio entry and one image entry.","items":{"type":"object","required":["payload_type","data_size","data"],"properties":{"payload_type":{"type":"integer","enum":[1,2],"description":"`1` = audio (PCM int16, 16kHz mono, 1,280 bytes = 640 samples = 40ms). `2` = image (JPEG-encoded, resolution depends on config e.g. 1280×720)."},"data_size":{"type":"integer","format":"int32","description":"Byte length of the payload data."},"data":{"type":"string","format":"binary","description":"Raw payload bytes. For audio: PCM int16 bytes. For image: JPEG bytes."}}}}}}}}}
````

## The ErrorResponseMessage object

```json
{"openapi":"3.0.3","info":{"title":"Ojin Flashhead Lite 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"ErrorResponseMessage":{"type":"object","description":"Sent by the server when an error occurs.\n\n**Format:** JSON text frame.\n\n> **Note:** In some error conditions (e.g., no backend servers available), the server may send a plain text message instead of a structured JSON `ErrorResponse`. Your client should handle non-JSON text messages gracefully.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["errorResponse"]},"payload":{"type":"object","required":["code","message","timestamp"],"properties":{"code":{"type":"string","description":"Machine-readable error code.","enum":["AUTH_FAILED","UNAUTHORIZED","MISSING_CONFIG_ID","INVALID_MESSAGE","INVALID_HEADERS","MODEL_NOT_FOUND","BACKEND_UNAVAILABLE","RATE_LIMITED","TIMEOUT","CANCELLED","INTERNAL_ERROR","FRAME_SIZE_EXCEEDED"]},"message":{"type":"string","description":"Human-readable description of the error."},"interaction_id":{"type":"string","nullable":true,"description":"The interaction ID related to the error, if applicable."},"details":{"type":"object","additionalProperties":true,"nullable":true,"description":"Optional additional structured details about the error."},"timestamp":{"type":"integer","format":"int64","description":"Milliseconds since Unix epoch when the error was sent."}}}}}}}}
```

### Client → Server Messages

## The InteractionInputMessage object

````json
{"openapi":"3.0.3","info":{"title":"Ojin Flashhead Lite 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"InteractionInputMessage":{"type":"object","description":"Binary message for sending speech audio to the server. **Only send speech audio** — do not send silence or padding. The server generates silence frames automatically.\n\n**Format:** Binary frame.\n\n**Binary structure (big-endian):**\n```\n[1 byte ]  Payload type   — uint8, always 1 for audio\n[8 bytes]  Timestamp      — uint64, milliseconds since Unix epoch\n[4 bytes]  Params size    — uint32, byte length of JSON params (0 if none)\n[N bytes]  Params JSON    — UTF-8 JSON (only present if params size > 0)\n[M bytes]  Audio payload  — raw PCM int16 speech audio\n```\n\nPython pack: `struct.pack('!BQI', payload_type, timestamp, params_size)`","required":["payload_type","timestamp","params_size","audio_payload"],"properties":{"payload_type":{"type":"integer","enum":[1],"description":"Always `1` for audio."},"timestamp":{"type":"integer","format":"int64","description":"Milliseconds since Unix epoch when the message was sent."},"params_size":{"type":"integer","format":"int32","minimum":0,"description":"Byte length of the JSON params block. `0` if no params."},"params":{"type":"object","nullable":true,"description":"Optional per-chunk parameters. Overrides session defaults for this audio chunk.","properties":{"speech_filter_amount":{"type":"number","format":"float","default":5,"description":"Smoothing for speech animation. Higher = smoother, less responsive."},"idle_filter_amount":{"type":"number","format":"float","default":1000,"description":"Smoothing for idle animation."},"idle_mouth_opening_scale":{"type":"number","format":"float","default":0,"description":"Mouth movement scale during idle. `0.0` = closed."},"speech_mouth_opening_scale":{"type":"number","format":"float","default":1,"description":"Mouth movement scale during speech. `1.0` = full movement."},"client_frame_index":{"type":"integer","format":"int32","default":0,"description":"Frame index the client is currently displaying. Helps the server manage silence-to-speech transitions smoothly."}}},"audio_payload":{"type":"string","format":"binary","description":"Raw PCM int16 speech audio. Requirements: 16,000 Hz sample rate, mono (1 channel), little-endian int16 samples. Entire message must be under 512 KB. Recommended chunk size: 400ms = 6,400 samples = 12,800 bytes."}}}}}}
````

## The EndInteractionMessage object

```json
{"openapi":"3.0.3","info":{"title":"Ojin Flashhead Lite 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"EndInteractionMessage":{"type":"object","description":"Signal graceful end of the session. The server finishes processing all queued audio and sends remaining frames, with the last frame marked `is_final: true`.\n\n**Format:** JSON text frame.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["endInteraction"]},"payload":{"type":"object","required":["timestamp"],"properties":{"timestamp":{"type":"integer","format":"int64","description":"Milliseconds since Unix epoch when the message was sent."}}}}}}}}
```

## The CancelInteractionMessage object

```json
{"openapi":"3.0.3","info":{"title":"Ojin Flashhead Lite 1.0 Realtime API","version":"1.0.0"},"components":{"schemas":{"CancelInteractionMessage":{"type":"object","description":"Immediately stop processing and discard all remaining frames. No final frame is sent. Use for interruptions (e.g., user starts speaking while the persona is talking).\n\n**Format:** JSON text frame.","required":["type","payload"],"properties":{"type":{"type":"string","enum":["cancelInteraction"]},"payload":{"type":"object","properties":{"timestamp":{"type":"integer","format":"int64","nullable":true,"description":"Optional. Milliseconds since Unix epoch when the message was sent."}}}}}}}}
```

***

## Message Details

### InteractionInput (Client → Server, Binary)

Binary message for sending speech audio to the server. **Only send speech audio** — do not send silence or padding.

**Binary structure:**

```
[1 byte ]  Payload type      — uint8, always 1 for audio
[8 bytes]  Timestamp          — uint64, milliseconds since Unix epoch
[4 bytes]  Params size        — uint32, byte length of the JSON params block (0 if no params)
[N bytes]  Params JSON        — UTF-8 encoded JSON (only present if params size > 0)
[M bytes]  Audio payload      — raw PCM int16 speech audio data
```

**Header fields** use **big-endian** byte order. The PCM audio samples in the payload use **little-endian** (standard for PCM int16). In Python: `struct.pack('!BQI', payload_type, timestamp, params_size)`.

**Audio requirements:**

| Property         | Value                                              |
| ---------------- | -------------------------------------------------- |
| Format           | PCM signed 16-bit integers (little-endian samples) |
| Sample rate      | 16,000 Hz                                          |
| Channels         | 1 (mono)                                           |
| Max message size | 512 KB (entire binary message including header)    |

**Recommended streaming pattern:**

Forward speech audio to the server as it arrives from your TTS service — no need to buffer or accumulate before sending (buffering only needed for playback). Send the audio in realtime to achieve a realtime streaming and smooth playback experience.

```python
import struct, json, time

def build_audio_message(audio_bytes):
    header = struct.pack('!BQI',
        1,                         # payload type: audio
        int(time.time() * 1000),   # timestamp ms
        0,                         # params size (unused for flashhead)
    )
    return header + audio_bytes
```

***

### InteractionResponse (Server → Client, Binary)

Binary message containing a video frame and synchronized audio. The server streams these continuously after `SessionReady`. **Frames always arrive in order.**

**Binary structure:**

```
[1 byte  ]  Is final flag     — uint8, 1 = last frame for this interaction, 0 = more coming
[16 bytes]  Interaction ID     — UUID bytes (big-endian)
[8 bytes ]  Timestamp          — uint64, milliseconds since Unix epoch
[4 bytes ]  Usage              — uint32, usage metric for this response
[4 bytes ]  Frame index        — uint32, 0 = silence, 1 = speech
[4 bytes ]  Num payloads       — uint32, number of payload entries that follow

For each payload entry:
  [4 bytes]  Data size          — uint32, byte length of the payload data only
  [1 byte ]  Payload type       — uint8, 1 = audio, 2 = image
  [N bytes]  Payload data       — raw payload bytes
```

All multi-byte integers are **big-endian**. In Python: `struct.unpack('!B16sQIII', header_bytes)` for the main header, `struct.unpack('!IB', entry_bytes)` for each payload entry.

**Frame index:**

| Index | Meaning                                                 |
| ----- | ------------------------------------------------------- |
| `0`   | **Silence frame** — persona at rest with idle animation |
| `1`   | **Speech frame** — lip-synced animation from your audio |

**Payload types:**

| Type      | Format                | Typical size per frame                                 |
| --------- | --------------------- | ------------------------------------------------------ |
| 1 (audio) | PCM int16, 16kHz mono | **1,280 bytes** (640 samples = 40ms at 25fps)          |
| 2 (image) | JPEG-encoded image    | Variable (resolution depends on config, e.g. 1280×720) |

**Parsing example:**

```python
import struct, uuid

HEADER_FMT = '!B16sQIII'
HEADER_SIZE = struct.calcsize(HEADER_FMT)   # 37 bytes
ENTRY_FMT = '!IB'
ENTRY_SIZE = struct.calcsize(ENTRY_FMT)     # 5 bytes

def parse_response(data):
    is_final, uuid_bytes, timestamp, usage, index, num_payloads = \
        struct.unpack(HEADER_FMT, data[:HEADER_SIZE])

    offset = HEADER_SIZE
    image = audio = None

    for _ in range(num_payloads):
        size, ptype = struct.unpack(ENTRY_FMT, data[offset:offset + ENTRY_SIZE])
        offset += ENTRY_SIZE
        payload = data[offset:offset + size]
        offset += size

        if ptype == 2:
            image = payload   # JPEG bytes
        elif ptype == 1:
            audio = payload   # PCM int16 bytes

    return {
        'is_final': bool(is_final),
        'index': index,              # 0 = silence, 1 = speech
        'image': image,
        'audio': audio,
    }
```

***

### EndInteraction vs CancelInteraction

| Message             | Purpose         | Server behavior                                                                | Use case          |
| ------------------- | --------------- | ------------------------------------------------------------------------------ | ----------------- |
| `EndInteraction`    | Graceful finish | Completes processing, sends remaining frames with last marked `is_final: true` | Session end       |
| `CancelInteraction` | Immediate stop  | Stops processing, discards remaining frames                                    | User interruption |

***

### ErrorResponse (Server → Client, JSON)

{% hint style="warning" %}
**Plain text errors:** In some error conditions (e.g., no backend servers available), the server may send a plain text message instead of a structured JSON `ErrorResponse`. Your client should handle non-JSON text messages gracefully.
{% endhint %}

**Error codes:**

| Code                  | Description                              |
| --------------------- | ---------------------------------------- |
| `AUTH_FAILED`         | Invalid API key                          |
| `UNAUTHORIZED`        | Caller lacks permission                  |
| `MISSING_CONFIG_ID`   | `config_id` query parameter not provided |
| `INVALID_MESSAGE`     | Malformed or unsupported message payload |
| `INVALID_HEADERS`     | Missing or invalid headers               |
| `MODEL_NOT_FOUND`     | Config ID not found or invalid           |
| `BACKEND_UNAVAILABLE` | No healthy inference backend available   |
| `RATE_LIMITED`        | Too many requests                        |
| `TIMEOUT`             | Operation exceeded processing time       |
| `CANCELLED`           | Interaction cancelled by client          |
| `INTERNAL_ERROR`      | Unexpected server error                  |
| `FRAME_SIZE_EXCEEDED` | Message exceeded 512KB limit             |

***

## Rate Limits & Constraints

| Constraint       | Value                                                   |
| ---------------- | ------------------------------------------------------- |
| Rate limit       | 6 requests per second                                   |
| Max message size | 512 KB per message                                      |
| Video output     | 25 fps target (generated slightly faster than realtime) |

Exceeding limits results in an `ErrorResponse` with code `RATE_LIMITED`.

***

## Best Practices

### Audio Input

* **Send one silence frame first** to start the conversation, then send speech audio when available
* Forward speech audio to the server as it arrives from your TTS service — no need to buffer or accumulate before sending(buffering only needed for playback). Send the audio in realtime to achieve a realtime streaming and smooth playback experience.

### Buffer Management

* Play frames at **25 fps** (40ms per frame)
* The server generates slightly faster than realtime — **you must drop silence frames** to prevent the buffer from growing
* **Drop strategy:** when the buffer grows beyond your target size, skip 1 out of every 2 silence frames (`index == 0`) until the buffer shrinks back
* **Never drop speech frames** (`index == 1`)
* Tune your target buffer size based on your network conditions — keep it as low as possible for minimal latency

### Audio and Video Synchronization

Each `InteractionResponse` contains both a JPEG image and a PCM audio chunk. However, the **recommended approach for audio playback** is:

1. **Buffer your TTS/source audio locally** as it arrives from your speech service (for later playback only. You do not need to buffer it to send it to Ojin)
2. **Forward TTS audio to Ojin immediately** as it arrives from your speech service
3. **Wait for the first speech frame** (`index == 1`) to arrive from Ojin
4. **Start playing your buffered TTS audio** at that moment
5. **Stop audio playback** when you receive the first silence frame (`index == 0`) after speech
6. **Render video** from every frame regardless of type

Sending is immediate, but playback is gated on the first speech frame. This ensures audio and video stay in sync.

```python
# When TTS audio arrives from your speech service:
speech_audio_buffer.extend(tts_audio_chunk)      # buffer locally for playback
await ojin.send_audio(tts_audio_chunk)            # send to Ojin for lip-sync

# In your video playback loop:
frame = buffer.popleft()

if frame.index == 1 and not audio_playing:
    start_audio_playback(speech_audio_buffer)     # begin draining the buffer
    audio_playing = True

if frame.index == 0 and audio_playing:
    stop_audio_playback()
    audio_playing = False

render_video(frame.image)                         # always render the video
```

### Error Handling

* Handle both JSON `ErrorResponse` messages and plain text error strings
* Implement exponential backoff for reconnection
* Monitor server `load` in the `SessionReady` message

### Interruption Handling

* Use `CancelInteraction` for immediate stops (e.g., user interrupts the bot)
* Use `EndInteraction` for graceful session endings
* Clear your frame buffer on interruption

***

## Complete Example

```python
import asyncio
import json
import struct
import time
from collections import deque
import numpy as np
import websockets
from dotenv import load_dotenv
import os

load_dotenv()

API_KEY = os.getenv("OJIN_API_KEY", "")
CONFIG_ID = os.getenv("OJIN_CONFIG_ID", "")
URL = f"wss://models.ojin.ai/realtime?config_id={CONFIG_ID}"

SAMPLE_RATE = 16000
FPS = 25
TARGET_BUFFER = 10  # Tune based on your network conditions

def build_audio_message(audio_bytes):
    """Build a binary InteractionInput message."""
    header = struct.pack('!BQI', 1, int(time.time() * 1000), 0)
    return header + audio_bytes

def parse_response(data):
    """Parse a binary InteractionResponse message."""
    fmt = '!B16sQIII'
    hdr_size = struct.calcsize(fmt)
    is_final, uid_bytes, ts, usage, index, n_payloads = struct.unpack(fmt, data[:hdr_size])

    offset = hdr_size
    image = audio = None
    for _ in range(n_payloads):
        size, ptype = struct.unpack('!IB', data[offset:offset+5])
        offset += 5
        if ptype == 2:
            image = data[offset:offset+size]
        elif ptype == 1:
            audio = data[offset:offset+size]
        offset += size

    return {
        'is_final': bool(is_final),
        'index': index,              # 0 = silence, 1 = speech
        'image': image,
        'audio': audio,
    }

async def main():
    headers = {"Authorization": API_KEY}
    # For older websockets versions, use extra_headers instead.
    async with websockets.connect(URL, additional_headers=headers, ping_interval=30) as ws:
        # 1. Wait for SessionReady — server starts streaming frames immediately after
        msg = json.loads(await ws.recv())
        assert msg["type"] == "sessionReady"
        print(f"Session ready: {msg['payload']}")

        buffer = deque()
        skip_counter = 0
        playback_started = False
        frame_count = 0
        audio_sent = False

        # 2. Send one silence frame to start processing
        audio_data = b"\00\00" * (SAMPLE_RATE // 25)
        await ws.send(build_audio_message(audio_data))

        # 3. Receive and process frames
        async for data in ws:
            if isinstance(data, str):
                msg = json.loads(data)
                if msg.get("type") == "errorResponse":
                    print(f"Error: {msg['payload']}")
                    break
                continue

            frame = parse_response(data)
            buffer.append(frame)
            frame_count += 1

            # Wait for initial buffer before playback
            if not playback_started:
                if len(buffer) >= TARGET_BUFFER:
                    playback_started = True
                    print(f"Buffer filled ({TARGET_BUFFER} frames), starting playback")
                continue

            # Consume one frame
            if buffer:
                play_frame = buffer.popleft()

                # Drop excess silence frames: skip 1 out of 2 when buffer is too large
                if len(buffer) > TARGET_BUFFER and play_frame['index'] == 0:
                    skip_counter += 1
                    if skip_counter % 2 == 0 and len(buffer) > 0:
                        buffer.popleft()  # drop one silence frame

                kind = "silence" if play_frame['index'] == 0 else "speech"
                print(f"[{kind}] frame #{frame_count}, buffer={len(buffer)}")

                # In a real app: render play_frame['image'] and play play_frame['audio']

            # Demo: send speech audio after receiving some silence frames
            if frame_count == 50 and not audio_sent:
                t = np.linspace(0, 1.0, SAMPLE_RATE, endpoint=False)
                audio_data = (32767 * 0.5 * np.sin(2 * np.pi * 440 * t)).astype(np.int16)
                chunk_size = SAMPLE_RATE # 1s chunks
                for i in range(0, len(audio_data), chunk_size):
                    chunk = audio_data[i:i + chunk_size]
                    await ws.send(build_audio_message(chunk.tobytes()))
                audio_sent = True
                print("Sent 1 second of speech audio")

            if frame_count > 200:
                break

asyncio.run(main())
```

***

## Troubleshooting

### Connection Issues

* ✓ Verify API key and config ID
* ✓ Check that config exists in dashboard
* ✓ Ensure network allows WebSocket connections (port 443)
* ✓ Check the `Authorization` header uses the raw API key (no `Bearer` prefix)

### No Frames Received

* ✓ Confirm you received `SessionReady` — frames start streaming immediately after
* ✓ If sending speech audio: verify format is 16kHz, int16, mono with big-endian message header
* ✓ Check message size < 512KB

### Choppy Playback

* ✓ Play at 25fps (40ms per frame)
* ✓ Buffer some frames before starting playback
* ✓ Check network latency and jitter

### Growing Latency

* ✓ You **must** drop silence frames — the server generates faster than realtime
* ✓ Skip 1 out of 2 silence frames (`index == 0`) when buffer grows beyond your target
* ✓ Never drop speech frames (`index == 1`)
* ✓ During speech bursts the buffer will grow temporarily — this is expected, trim silence frames afterward

### Frame Lag During Speech

* ✓ Reduce `speech_filter_amount` parameter (lower = more responsive, less smooth)

***

## Example Implementation

A complete working Python example integrating Ojin Flashhead with a speech-to-speech service (Hume EVI) is available here:

[**github.com/journee-live/speech-to-video-samples/tree/main/samples/hume-sts-flashhead**](https://github.com/journee-live/speech-to-video-samples/tree/main/samples/hume-sts-flashhead)

The repository demonstrates the full integration pattern: microphone capture → STS service → TTS audio → Ojin lip-sync → synchronized video and audio playback at 25fps. It includes the buffer management and frame handling approach described in [Best Practices](#best-practices) above.

***


# Troubleshooting

## Common Issues

### Connection Issues

* ✓ Verify API key and config ID
* ✓ Check that config exists in dashboard
* ✓ Ensure network allows WebSocket connections

### No Frames Received

* ✓ Confirm you received `SessionReady` before sending audio
* ✓ Verify audio format (16kHz, int16, mono)
* ✓ Check message size < 512KB

### Choppy Playback

* ✓ Play at exactly 25 fps
* ✓ Buffer at least 10 frames before playback
* ✓ Check network latency

### Frame Lag

* ✓ Reduce `speech_filter_amount` parameter