Irodori-TTS-500M-v3 Demo

Flow-matching based Japanese TTS model (500M parameters). Generates speech from text using rectified flow over DACVAE latents.

Reference audio: Optional. Upload to condition the speaker voice. Leave blank for unconditional generation.
Duration: By default, v3 predicts the output duration automatically. Use Duration Scale for small adjustments or Seconds for exact manual control.

Reference Audio Upload (optional, blank = no-reference mode)

Num Steps

1 120

Num Candidates

1 32

Seed (blank=random)

Seconds (blank=auto)

Duration Scale

0.5 1.5

CFG Guidance Mode

CFG Scale Text

0 10

CFG Scale Speaker

0 10