Sets the target language for voice generation. The model will synthesize speech in this language regardless of the input text language.
Instruct (Preset only)
Text instruction for guiding preset voice style. Examples: "Speak slowly and warmly", "Read as a news anchor". Only available with preset voices.
Temperature
Controls randomness in token sampling. Lower values (0.1-0.3) produce more deterministic, consistent output. Higher values (0.7-1.0) increase variation and expressiveness. Default: 0.7.
0.7
Top-K
Limits the number of candidate tokens at each generation step to the K most probable. Lower values (10-30) produce safer output; higher values (50-100) allow more diversity. Default: 50.
50
Top-P (Nucleus Sampling)
Cumulative probability threshold for token selection. The model considers the smallest set of tokens whose combined probability exceeds this value. Lower = more focused, higher = more diverse. Default: 0.9.
0.9
Repetition Penalty
Penalizes tokens that have already appeared, reducing loops and stuttering. 1.0 = no penalty. Values above 1.2 can reduce naturalness. Default: 1.0.
1.0
Seed
Fixed random seed for reproducible generations. Leave empty for random output each time. Same seed + same parameters = same audio.
Voice Samples
Audio clips used to clone a voice. More samples with clear speech and low background noise produce better results. Each sample should be under 30 seconds. A transcript matching the spoken words is highly recommended.
Clone Voice
Add an audio sample to get started immediately. You can add more samples later.
Drop audio orclick
Supported formats: WAV, MP3, M4A. Maximum duration: 0:30. Click "Transcribe" to automatically extract text from the audio.
This should match exactly what is spoken in the audio. Required if you add a sample.