Creating a Configuration
Before you can start using the ojin/oris-voice model, you need to create a voice configuration. This guide walks you through the process.
Prerequisites
An Ojin account with an active API key
For voice cloning: a reference audio clip of the target voice
Creating a Configuration through the Dashboard
Log in to the Ojin Dashboard
Select Oris Voice (
ojin/oris-voice)Navigate to Library sub-section
Press New Configuration to create a new configuration
Select the voice mode and fill in the required fields (see below)
Click Create Configuration
Open your newly created configuration and copy the Model Config ID for use in your application
Voice Mode Configuration
Clone Mode
Reproduces a voice from a reference audio sample.
Voice Mode
Yes
Set to "Clone"
Reference audio
Yes
Reference audio file (WAV recommended). Can be a file upload or an asset UUID.
Language
No
Language of the text to synthesize (default: "English").
Reference audio best practices:
Duration: 5-15 seconds of clean speech
Quality: Clear recording with minimal background noise
Format: WAV, 16 kHz or higher sample rate
Content: Natural conversational speech (not whispered or shouted)
Built-in Voices Mode
Uses a built-in speaker identity.
Voice Mode
Yes
Set to "Built-in Voices"
Speaker
Yes
Name of the built-in speaker identity
instruct
No
Optional styling instruction layered on the selected speaker (e.g., "speak slowly and warmly")
Language
No
Language of the text to synthesize (default: "English")
Voice Design Mode
Generates a voice from a natural language description.
Voice Mode
Yes
Set to "Design"
instruct
Yes
Natural language description of the desired voice (e.g., "a deep male voice with a calm tone")
Language
No
Language of the text to synthesize (default: "English")
Generation Parameters
These optional parameters can be set in any mode to control the generation behavior:
Temperature
0.9
Controls randomness in speech generation. 0.1 = highly consistent, robotic. 0.9 = natural variation. 1.0+ = maximum variety, may introduce artifacts. For production use cases, 0.7–0.9 is recommended.
Top-k
50
Limits token sampling to the top-k most probable candidates per step. Lower values (e.g., 20) produce more predictable speech; higher values (e.g., 100) allow more variety.
Max new tokens
360
Maximum tokens to generate per interaction. At 12 Hz codec rate, 360 tokens ≈ 30 seconds of audio. Increase for longer utterances; decrease to cap generation time.
Repetition penalty
1.05
Discourages the model from repeating the same sounds or patterns. Values above 1.0 reduce repetition; too high (e.g., 1.5+) may degrade naturalness.
Random seed
null
Set a fixed integer for reproducible output (same text + seed = same audio). Leave null for natural variation between generations.
Next Steps
Once your configuration is ready, you can:
Integration Guide
Learn how to integrate TTS using WebSocket
API Reference
Explore the complete API documentation
Last updated
Was this helpful?