Creating a Configuration

Before you can start using the ojin/oris-voice model, you need to create a voice configuration. This guide walks you through the process.

Prerequisites

  • An Ojin account with an active API key

  • For voice cloning: a reference audio clip of the target voice

Creating a Configuration through the Dashboard

  1. Log in to the Ojin Dashboard

  2. Select Oris Voice (ojin/oris-voice)

  3. Navigate to Library sub-section

  4. Press New Configuration to create a new configuration

  5. Select the voice mode and fill in the required fields (see below)

  6. Click Create Configuration

  7. Open your newly created configuration and copy the Model Config ID for use in your application

Voice Mode Configuration

Clone Mode

Reproduces a voice from a reference audio sample.

Field
Required
Description

Voice Mode

Yes

Set to "Clone"

Reference audio

Yes

Reference audio file (WAV recommended). Can be a file upload or an asset UUID.

Language

No

Language of the text to synthesize (default: "English").

Reference audio best practices:

  • Duration: 5-15 seconds of clean speech

  • Quality: Clear recording with minimal background noise

  • Format: WAV, 16 kHz or higher sample rate

  • Content: Natural conversational speech (not whispered or shouted)

Built-in Voices Mode

Uses a built-in speaker identity.

Field
Required
Description

Voice Mode

Yes

Set to "Built-in Voices"

Speaker

Yes

Name of the built-in speaker identity

instruct

No

Optional styling instruction layered on the selected speaker (e.g., "speak slowly and warmly")

Language

No

Language of the text to synthesize (default: "English")

Voice Design Mode

Generates a voice from a natural language description.

Field
Required
Description

Voice Mode

Yes

Set to "Design"

instruct

Yes

Natural language description of the desired voice (e.g., "a deep male voice with a calm tone")

Language

No

Language of the text to synthesize (default: "English")

Generation Parameters

These optional parameters can be set in any mode to control the generation behavior:

Parameter
Default
Description

Temperature

0.9

Controls randomness in speech generation. 0.1 = highly consistent, robotic. 0.9 = natural variation. 1.0+ = maximum variety, may introduce artifacts. For production use cases, 0.70.9 is recommended.

Top-k

50

Limits token sampling to the top-k most probable candidates per step. Lower values (e.g., 20) produce more predictable speech; higher values (e.g., 100) allow more variety.

Max new tokens

360

Maximum tokens to generate per interaction. At 12 Hz codec rate, 360 tokens ≈ 30 seconds of audio. Increase for longer utterances; decrease to cap generation time.

Repetition penalty

1.05

Discourages the model from repeating the same sounds or patterns. Values above 1.0 reduce repetition; too high (e.g., 1.5+) may degrade naturalness.

Random seed

null

Set a fixed integer for reproducible output (same text + seed = same audio). Leave null for natural variation between generations.

Next Steps

Once your configuration is ready, you can:

Integration Guide

Learn how to integrate TTS using WebSocket

View Integration Guide →

API Reference

Explore the complete API documentation

View API Reference →

Last updated

Was this helpful?