Model | GitHub
Flow-matching based Japanese TTS model (500M parameters). Generates speech from text using rectified flow over DACVAE latents.
- Reference audio: Optional. Upload to condition the speaker voice. Leave blank for unconditional generation.
- Duration: By default, v3 predicts the output duration automatically. Use Duration Scale for small adjustments or Seconds for exact manual control.