# Playback Tips

## Basics

* Video models such as `ojin/oris-portrait` will deliver frames at a fixed rate (e.g. 25fps)
* There are exceptions to the fixed rate delivery by the server, such as when the model is delivering first speech frames in which case it will try to deliver frames at a higher rate (e.g. 50fps). This is to ensure covering for any network jitter.
* Frames tagged as speech frames will have frame\_idx = 1 while silence frames will have frame\_idx = 0

## Playback core

* Recommended playback is done through audio clock technique
* We recommend buffering incoming video frames to control their rate based on audio head.
* TTS audio from TTS services need to be held also on a buffer so that audio is not played immediately. Instead clients need to wait for first speech frame to start audio playback
* When speech starts audio should play at a steady rate to avoid crackling, while speech video frames will just try to sync with the current head of the audio. If for some reason video frames don't come fast enough audio should never stop. This situation will create a small unsync, but eventually it can be corrected by skipping smartly incoming speech video frames if the frame rate allows. A good compromise is to skip one frame every second frame if the speech video frames buffer is bigger than a small threshold and video is lagging behind the audio.

## Turn detection (bot speaking, bot not speaking)

* Silence -> speech (bot starts speaking): whenever the first video frame is ready to be played we can consider the bot started speaking and the audio playback can start
* Speech -> silence (bot stops speaking): whenever we play the first silence video frame after speech we can consider the bot stopped speaking and audio playback can stop

## Interruptions

In order to interrupt our video models users must send a CancelInteractionMessage. The server will flush almost all video frames being processed at that moment and start generating silence frames right away. There might still be some old frames coming to the client that need to be handled. There are several approaches to handle interruptions based on your needs, it's usually a trade off between having jumps on your avatar or introducing interruption latency:

1. **Smooth transitions**: when interrupting client doesn't discard any frames either from buffer or incoming. In this case transitions will be smooth, but the bot will continue speaking (moving lips) for some time since the system will not be able to clear frames already sent or frames already in client buffer.
2. **Instant cut**: when interrupting, client clears current buffer and discards any incoming speech frames until it receives first silence frame after interruption. This could create a small freeze and/or jump between last speech frame played and first silence frame after interruption but interruption latency will be 0, you can stop playing audio right away.
3. **Smooth video / hard cut audio**: there is a third option which is keep playing video frames to have smooth transitions but stop audio playback immediately. If incoming frames + client buffered frames are not so many, this usually creates the best experience.

## Edge cases

The playback loop must handle these scenarios correctly:

1. **Audio exhausted during speech (deadlock prevention)**: when TTS audio buffer runs out while `is_speaking` is True and the video sync guard blocks frame consumption (`video_sent >= audio_released`), the loop must force a transition to idle. Otherwise frames pile up but are never consumed, and `is_speaking` stays True forever.
2. **Speech frames queued ahead of silence frames**: when transitioning from speech to silence, silence frames may be behind unconsumed speech frames in the buffer. The sync guard blocks those speech frames (no audio left), so the silence-transition check never sees a silence frame at position \[0]. The audio-exhaustion guard (edge case 1) resolves this by forcing idle mode.
3. **TTS audio arrives before video frames**: audio buffer fills but `is_speaking` is False because no speech video frame has arrived yet. Audio must NOT play until the first speech video frame is ready. Once it arrives, counters reset and sync begins.
4. **TTS audio arrives in bursts with gaps**: audio buffer may run dry mid-speech, then refill when more TTS audio arrives. During the dry period, video must pause (not advance). When audio resumes, sync continues from where it left off without losing frames.
5. **Video buffer drains completely during speech**: if the server is slow and no video frames are available, the last played frame should repeat. Audio continues uninterrupted. When video frames arrive again, they catch up via the sync mechanism.
6. **Rapid turn succession (speech -> silence -> speech)**: each speech turn must reset audio/video counters independently. A brief silence gap (even 1-2 frames) between turns must correctly trigger stop then start events.
7. **Zero-volume speech frames at turn boundary**: the first frame with `frame_idx=1` might have zero volume (model warmup). It should NOT be treated as the first real speech frame; wait for a frame with actual audio content to trigger speech start.
8. **Interruption during silence**: if an interruption arrives while already in idle/silence mode, it should be a no-op for the playback loop (no state to clean up).
9. **Interruption with stale speech frames in buffer**: after interruption, the server sends silence but stale speech frames may still be in the client buffer. Depending on the interruption strategy, these must be either played through, discarded, or played without audio.
10. **Single speech frame turn**: a turn with only 1 speech frame followed by silence must correctly trigger speech\_start, play the frame with audio, then trigger speech\_stop on the next tick.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.ojin.ai/models/oris-portrait/playback-tips.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
