VoiceBox

Parameter Guide

Language

Sets the target language for voice generation. The model will synthesize speech in this language regardless of the input text language.

Instruct (Preset only)

Text instruction for guiding preset voice style. Examples: "Speak slowly and warmly", "Read as a news anchor". Only available with preset voices.

Temperature

Controls randomness in token sampling. Lower values (0.1-0.3) produce more deterministic, consistent output. Higher values (0.7-1.0) increase variation and expressiveness. Default: 0.7.

0.7

Top-K

Limits the number of candidate tokens at each generation step to the K most probable. Lower values (10-30) produce safer output; higher values (50-100) allow more diversity. Default: 50.

Top-P (Nucleus Sampling)

Cumulative probability threshold for token selection. The model considers the smallest set of tokens whose combined probability exceeds this value. Lower = more focused, higher = more diverse. Default: 0.9.

0.9

Repetition Penalty

Penalizes tokens that have already appeared, reducing loops and stuttering. 1.0 = no penalty. Values above 1.2 can reduce naturalness. Default: 1.0.

1.0

Seed

Fixed random seed for reproducible generations. Leave empty for random output each time. Same seed + same parameters = same audio.

Voice Samples

Audio clips used to clone a voice. More samples with clear speech and low background noise produce better results. Each sample should be under 30 seconds. A transcript matching the spoken words is highly recommended.

Clone Voice

Name

Description (Optional)

Language

Add Sample (Optional)

Add an audio sample to get started immediately. You can add more samples later.

Audio File

Drop audio or click

Supported formats: WAV, MP3, M4A. Maximum duration: 0:30. Click "Transcribe" to automatically extract text from the audio.

Voices

Settings

Generation History

Parameter Guide

Language

Instruct (Preset only)

Temperature

Top-K

Top-P (Nucleus Sampling)

Repetition Penalty

Seed

Voice Samples

Clone Voice