API Reference
Overview
Real-time talking head synthesis API. Send speech audio, receive synchronized video and audio frames.
After connecting and receiving SessionReady, you should send one initial audio message with silence (one frame worth of silent audio) to start the interaction on the server. The server then immediately begins streaming video and audio frames at 25fps. When no speech audio has been sent, the server generates silence frames (persona at rest with idle animation). When you send speech audio, the server generates speech frames with lip-synced animation synchronized to your audio.
After the initial silence frame, you only need to send speech audio — no additional silence, padding, or keep-alive messages are required.
Staging deployments: For secure, low-latency video applications, connect to the real-time WebSocket API from a backend server rather than a front-end client (to keep your API key secure and leverage a network transport appropriate for real-time video media delivery under varying network conditions). Typically, WebRTC is used to deliver the final media stream to end users for smooth, reliable, low-latency playback.
How It Works
Connect to the WebSocket endpoint with your API key and config ID
Receive
SessionReady— the server has allocated inference resources for your sessionSend initial audio message with silence for one frame.
The server starts streaming frames immediately — silence frames with idle animation
Send speech audio whenever it becomes available (e.g., TTS output from your language model) — also buffer it locally for playback
Receive speech frames — the server transitions to lip-synced animation and returns to silence frames when audio runs out
Render video frames at 25fps, dropping excess silence frames to manage buffer size
Start playing your buffered TTS audio when the first speech frame (
index == 1) arrives from Ojin — stop when speech ends
Frame Types
Every frame arrives as a binary InteractionResponse containing both a JPEG image and a PCM audio chunk. Frames are always delivered in order. The index field identifies the frame type:
0
Silence
Persona at rest with idle animation. Generated automatically when no speech audio is queued
1
Speech
Lip-synced animation generated from your audio input
Faster-than-Realtime Generation
The server generates frames slightly faster than realtime to build a client-side buffer that prevents stuttering during speech. During speech bursts, generation is even faster. This means your frame buffer will grow over time if you don't manage it.
You must drop silence frames to prevent unbounded buffer growth. When your buffer starts growing beyond what you need for smooth playback, skip 1 out of every 2 silence frames (index == 0) until the buffer shrinks back down. Never drop speech frames (index == 1).
The right buffer target depends on your network conditions and latency requirements — start by observing your buffer size during playback and tuning from there. Keep it as low as possible to minimize latency, but high enough to absorb network jitter without starving playback.
Connection Flow
WebSocket Handshake
Connect to the WebSocket endpoint providing an API key in the Authorization header and a config_id query parameter. The server upgrades the connection to WebSocket and immediately begins streaming frames after sending SessionReady.
Recommended WebSocket settings:
ping_interval: 30 secondsping_timeout: 10 seconds
Raw API key (no Bearer prefix).
Configuration ID for the persona, created in the Flashhead Lite 1.0 tab of the dashboard.
Your raw API key. No Bearer prefix.
your-api-keyWebSocket upgrade successful. After the upgrade, the server sends a SessionReady JSON message and begins streaming binary InteractionResponse frames immediately.
Unauthorized — invalid or missing API key.
Message Format
Mixed message types: Both JSON (text) and binary messages are exchanged on the same WebSocket connection. Your client must check the WebSocket frame type to distinguish them:
Text frames (JSON):
SessionReady,ErrorResponse(server → client),EndInteraction,CancelInteraction(client → server)Binary frames:
InteractionResponse(server → client),InteractionInput(client → server)
Byte order: All multi-byte integer fields in binary messages use network byte order (big-endian).
Messages Reference
Server → Client Messages
Client → Server Messages
Message Details
InteractionInput (Client → Server, Binary)
Binary message for sending speech audio to the server. Only send speech audio — do not send silence or padding.
Binary structure:
Header fields use big-endian byte order. The PCM audio samples in the payload use little-endian (standard for PCM int16). In Python: struct.pack('!BQI', payload_type, timestamp, params_size).
Audio requirements:
Format
PCM signed 16-bit integers (little-endian samples)
Sample rate
16,000 Hz
Channels
1 (mono)
Max message size
512 KB (entire binary message including header)
Recommended streaming pattern:
Forward speech audio to the server as it arrives from your TTS service — no need to buffer or accumulate before sending (buffering only needed for playback). Send the audio in realtime to achieve a realtime streaming and smooth playback experience.
InteractionResponse (Server → Client, Binary)
Binary message containing a video frame and synchronized audio. The server streams these continuously after SessionReady. Frames always arrive in order.
Binary structure:
All multi-byte integers are big-endian. In Python: struct.unpack('!B16sQIII', header_bytes) for the main header, struct.unpack('!IB', entry_bytes) for each payload entry.
Frame index:
0
Silence frame — persona at rest with idle animation
1
Speech frame — lip-synced animation from your audio
Payload types:
1 (audio)
PCM int16, 16kHz mono
1,280 bytes (640 samples = 40ms at 25fps)
2 (image)
JPEG-encoded image
Variable (resolution depends on config, e.g. 1280×720)
Parsing example:
EndInteraction vs CancelInteraction
EndInteraction
Graceful finish
Completes processing, sends remaining frames with last marked is_final: true
Session end
CancelInteraction
Immediate stop
Stops processing, discards remaining frames
User interruption
ErrorResponse (Server → Client, JSON)
Plain text errors: In some error conditions (e.g., no backend servers available), the server may send a plain text message instead of a structured JSON ErrorResponse. Your client should handle non-JSON text messages gracefully.
Error codes:
AUTH_FAILED
Invalid API key
UNAUTHORIZED
Caller lacks permission
MISSING_CONFIG_ID
config_id query parameter not provided
INVALID_MESSAGE
Malformed or unsupported message payload
INVALID_HEADERS
Missing or invalid headers
MODEL_NOT_FOUND
Config ID not found or invalid
BACKEND_UNAVAILABLE
No healthy inference backend available
RATE_LIMITED
Too many requests
TIMEOUT
Operation exceeded processing time
CANCELLED
Interaction cancelled by client
INTERNAL_ERROR
Unexpected server error
FRAME_SIZE_EXCEEDED
Message exceeded 512KB limit
Rate Limits & Constraints
Rate limit
6 requests per second
Max message size
512 KB per message
Video output
25 fps target (generated slightly faster than realtime)
Exceeding limits results in an ErrorResponse with code RATE_LIMITED.
Best Practices
Audio Input
Send one silence frame first to start the conversation, then send speech audio when available
Forward speech audio to the server as it arrives from your TTS service — no need to buffer or accumulate before sending(buffering only needed for playback). Send the audio in realtime to achieve a realtime streaming and smooth playback experience.
Buffer Management
Play frames at 25 fps (40ms per frame)
The server generates slightly faster than realtime — you must drop silence frames to prevent the buffer from growing
Drop strategy: when the buffer grows beyond your target size, skip 1 out of every 2 silence frames (
index == 0) until the buffer shrinks backNever drop speech frames (
index == 1)Tune your target buffer size based on your network conditions — keep it as low as possible for minimal latency
Audio and Video Synchronization
Each InteractionResponse contains both a JPEG image and a PCM audio chunk. However, the recommended approach for audio playback is:
Buffer your TTS/source audio locally as it arrives from your speech service (for later playback only. You do not need to buffer it to send it to Ojin)
Forward TTS audio to Ojin immediately as it arrives from your speech service
Wait for the first speech frame (
index == 1) to arrive from OjinStart playing your buffered TTS audio at that moment
Stop audio playback when you receive the first silence frame (
index == 0) after speechRender video from every frame regardless of type
Sending is immediate, but playback is gated on the first speech frame. This ensures audio and video stay in sync.
Error Handling
Handle both JSON
ErrorResponsemessages and plain text error stringsImplement exponential backoff for reconnection
Monitor server
loadin theSessionReadymessage
Interruption Handling
Use
CancelInteractionfor immediate stops (e.g., user interrupts the bot)Use
EndInteractionfor graceful session endingsClear your frame buffer on interruption
Complete Example
Troubleshooting
Connection Issues
✓ Verify API key and config ID
✓ Check that config exists in dashboard
✓ Ensure network allows WebSocket connections (port 443)
✓ Check the
Authorizationheader uses the raw API key (noBearerprefix)
No Frames Received
✓ Confirm you received
SessionReady— frames start streaming immediately after✓ If sending speech audio: verify format is 16kHz, int16, mono with big-endian message header
✓ Check message size < 512KB
Choppy Playback
✓ Play at 25fps (40ms per frame)
✓ Buffer some frames before starting playback
✓ Check network latency and jitter
Growing Latency
✓ You must drop silence frames — the server generates faster than realtime
✓ Skip 1 out of 2 silence frames (
index == 0) when buffer grows beyond your target✓ Never drop speech frames (
index == 1)✓ During speech bursts the buffer will grow temporarily — this is expected, trim silence frames afterward
Frame Lag During Speech
✓ Reduce
speech_filter_amountparameter (lower = more responsive, less smooth)
Example Implementation
A complete working Python example integrating Ojin Flashhead with a speech-to-speech service (Hume EVI) is available here:
github.com/journee-live/speech-to-video-samples/tree/main/samples/hume-sts-flashhead
The repository demonstrates the full integration pattern: microphone capture → STS service → TTS audio → Ojin lip-sync → synchronized video and audio playback at 25fps. It includes the buffer management and frame handling approach described in Best Practices above.
Last updated
Was this helpful?