API Reference

Overview

Real-time talking head synthesis API. Send speech audio, receive synchronized video and audio frames.

After connecting and receiving SessionReady, you should send one initial audio message with silence (one frame worth of silent audio) to start the interaction on the server. The server then immediately begins streaming video and audio frames at 25fps. When no speech audio has been sent, the server generates silence frames (persona at rest with idle animation). When you send speech audio, the server generates speech frames with lip-synced animation synchronized to your audio.

After the initial silence frame, you only need to send speech audio — no additional silence, padding, or keep-alive messages are required.

circle-info

Staging deployments: For secure, low-latency video applications, connect to the real-time WebSocket API from a backend server rather than a front-end client (to keep your API key secure and leverage a network transport appropriate for real-time video media delivery under varying network conditions). Typically, WebRTC is used to deliver the final media stream to end users for smooth, reliable, low-latency playback.


How It Works

  1. Connect to the WebSocket endpoint with your API key and config ID

  2. Receive SessionReady — the server has allocated inference resources for your session

  3. Send initial audio message with silence for one frame.

  4. The server starts streaming frames immediately — silence frames with idle animation

  5. Send speech audio whenever it becomes available (e.g., TTS output from your language model) — also buffer it locally for playback

  6. Receive speech frames — the server transitions to lip-synced animation and returns to silence frames when audio runs out

  7. Render video frames at 25fps, dropping excess silence frames to manage buffer size

  8. Start playing your buffered TTS audio when the first speech frame (index == 1) arrives from Ojin — stop when speech ends

Frame Types

Every frame arrives as a binary InteractionResponse containing both a JPEG image and a PCM audio chunk. Frames are always delivered in order. The index field identifies the frame type:

Index
Frame type
Description

0

Silence

Persona at rest with idle animation. Generated automatically when no speech audio is queued

1

Speech

Lip-synced animation generated from your audio input

Faster-than-Realtime Generation

The server generates frames slightly faster than realtime to build a client-side buffer that prevents stuttering during speech. During speech bursts, generation is even faster. This means your frame buffer will grow over time if you don't manage it.

You must drop silence frames to prevent unbounded buffer growth. When your buffer starts growing beyond what you need for smooth playback, skip 1 out of every 2 silence frames (index == 0) until the buffer shrinks back down. Never drop speech frames (index == 1).

The right buffer target depends on your network conditions and latency requirements — start by observing your buffer size during playback and tuning from there. Keep it as low as possible to minimize latency, but high enough to absorb network jitter without starving playback.


Connection Flow

spinner

WebSocket Handshake

Open WebSocket connection

get

Connect to the WebSocket endpoint providing an API key in the Authorization header and a config_id query parameter. The server upgrades the connection to WebSocket and immediately begins streaming frames after sending SessionReady.

Recommended WebSocket settings:

  • ping_interval: 30 seconds

  • ping_timeout: 10 seconds

Authorizations
AuthorizationstringRequired

Raw API key (no Bearer prefix).

Query parameters
config_idstringRequired

Configuration ID for the persona, created in the Flashhead Lite 1.0 tab of the dashboard.

Header parameters
AuthorizationstringRequired

Your raw API key. No Bearer prefix.

Example: your-api-key
Responses
get
/

Message Format

circle-info

Mixed message types: Both JSON (text) and binary messages are exchanged on the same WebSocket connection. Your client must check the WebSocket frame type to distinguish them:

  • Text frames (JSON): SessionReady, ErrorResponse (server → client), EndInteraction, CancelInteraction (client → server)

  • Binary frames: InteractionResponse (server → client), InteractionInput (client → server)

circle-info

Byte order: All multi-byte integer fields in binary messages use network byte order (big-endian).


Messages Reference

Server → Client Messages

Client → Server Messages


Message Details

InteractionInput (Client → Server, Binary)

Binary message for sending speech audio to the server. Only send speech audio — do not send silence or padding.

Binary structure:

Header fields use big-endian byte order. The PCM audio samples in the payload use little-endian (standard for PCM int16). In Python: struct.pack('!BQI', payload_type, timestamp, params_size).

Audio requirements:

Property
Value

Format

PCM signed 16-bit integers (little-endian samples)

Sample rate

16,000 Hz

Channels

1 (mono)

Max message size

512 KB (entire binary message including header)

Recommended streaming pattern:

Forward speech audio to the server as it arrives from your TTS service — no need to buffer or accumulate before sending (buffering only needed for playback). Send the audio in realtime to achieve a realtime streaming and smooth playback experience.


InteractionResponse (Server → Client, Binary)

Binary message containing a video frame and synchronized audio. The server streams these continuously after SessionReady. Frames always arrive in order.

Binary structure:

All multi-byte integers are big-endian. In Python: struct.unpack('!B16sQIII', header_bytes) for the main header, struct.unpack('!IB', entry_bytes) for each payload entry.

Frame index:

Index
Meaning

0

Silence frame — persona at rest with idle animation

1

Speech frame — lip-synced animation from your audio

Payload types:

Type
Format
Typical size per frame

1 (audio)

PCM int16, 16kHz mono

1,280 bytes (640 samples = 40ms at 25fps)

2 (image)

JPEG-encoded image

Variable (resolution depends on config, e.g. 1280×720)

Parsing example:


EndInteraction vs CancelInteraction

Message
Purpose
Server behavior
Use case

EndInteraction

Graceful finish

Completes processing, sends remaining frames with last marked is_final: true

Session end

CancelInteraction

Immediate stop

Stops processing, discards remaining frames

User interruption


ErrorResponse (Server → Client, JSON)

circle-exclamation

Error codes:

Code
Description

AUTH_FAILED

Invalid API key

UNAUTHORIZED

Caller lacks permission

MISSING_CONFIG_ID

config_id query parameter not provided

INVALID_MESSAGE

Malformed or unsupported message payload

INVALID_HEADERS

Missing or invalid headers

MODEL_NOT_FOUND

Config ID not found or invalid

BACKEND_UNAVAILABLE

No healthy inference backend available

RATE_LIMITED

Too many requests

TIMEOUT

Operation exceeded processing time

CANCELLED

Interaction cancelled by client

INTERNAL_ERROR

Unexpected server error

FRAME_SIZE_EXCEEDED

Message exceeded 512KB limit


Rate Limits & Constraints

Constraint
Value

Rate limit

6 requests per second

Max message size

512 KB per message

Video output

25 fps target (generated slightly faster than realtime)

Exceeding limits results in an ErrorResponse with code RATE_LIMITED.


Best Practices

Audio Input

  • Send one silence frame first to start the conversation, then send speech audio when available

  • Forward speech audio to the server as it arrives from your TTS service — no need to buffer or accumulate before sending(buffering only needed for playback). Send the audio in realtime to achieve a realtime streaming and smooth playback experience.

Buffer Management

  • Play frames at 25 fps (40ms per frame)

  • The server generates slightly faster than realtime — you must drop silence frames to prevent the buffer from growing

  • Drop strategy: when the buffer grows beyond your target size, skip 1 out of every 2 silence frames (index == 0) until the buffer shrinks back

  • Never drop speech frames (index == 1)

  • Tune your target buffer size based on your network conditions — keep it as low as possible for minimal latency

Audio and Video Synchronization

Each InteractionResponse contains both a JPEG image and a PCM audio chunk. However, the recommended approach for audio playback is:

  1. Buffer your TTS/source audio locally as it arrives from your speech service (for later playback only. You do not need to buffer it to send it to Ojin)

  2. Forward TTS audio to Ojin immediately as it arrives from your speech service

  3. Wait for the first speech frame (index == 1) to arrive from Ojin

  4. Start playing your buffered TTS audio at that moment

  5. Stop audio playback when you receive the first silence frame (index == 0) after speech

  6. Render video from every frame regardless of type

Sending is immediate, but playback is gated on the first speech frame. This ensures audio and video stay in sync.

Error Handling

  • Handle both JSON ErrorResponse messages and plain text error strings

  • Implement exponential backoff for reconnection

  • Monitor server load in the SessionReady message

Interruption Handling

  • Use CancelInteraction for immediate stops (e.g., user interrupts the bot)

  • Use EndInteraction for graceful session endings

  • Clear your frame buffer on interruption


Complete Example


Troubleshooting

Connection Issues

  • ✓ Verify API key and config ID

  • ✓ Check that config exists in dashboard

  • ✓ Ensure network allows WebSocket connections (port 443)

  • ✓ Check the Authorization header uses the raw API key (no Bearer prefix)

No Frames Received

  • ✓ Confirm you received SessionReady — frames start streaming immediately after

  • ✓ If sending speech audio: verify format is 16kHz, int16, mono with big-endian message header

  • ✓ Check message size < 512KB

Choppy Playback

  • ✓ Play at 25fps (40ms per frame)

  • ✓ Buffer some frames before starting playback

  • ✓ Check network latency and jitter

Growing Latency

  • ✓ You must drop silence frames — the server generates faster than realtime

  • ✓ Skip 1 out of 2 silence frames (index == 0) when buffer grows beyond your target

  • ✓ Never drop speech frames (index == 1)

  • ✓ During speech bursts the buffer will grow temporarily — this is expected, trim silence frames afterward

Frame Lag During Speech

  • ✓ Reduce speech_filter_amount parameter (lower = more responsive, less smooth)


Example Implementation

A complete working Python example integrating Ojin Flashhead with a speech-to-speech service (Hume EVI) is available here:

github.com/journee-live/speech-to-video-samples/tree/main/samples/hume-sts-flashheadarrow-up-right

The repository demonstrates the full integration pattern: microphone capture → STS service → TTS audio → Ojin lip-sync → synchronized video and audio playback at 25fps. It includes the buffer management and frame handling approach described in Best Practices above.


Last updated

Was this helpful?