PIPELINE/VOICE & AUDIO

Voice & audio

Neural voices, per-account cloning, word-level alignment, optional music.

Stage 4 of 53 min read · Updated Jun 2026

NOTECloning a voice needs your explicit consent to use the sample, and we don’t train models on it — a clone is a voice scoped to your account.

One voice provider, one alignment stage, one optional music track. Voice is where your script becomes audio, alignment makes that audio legible to the shot planner, and music sits underneath the whole thing at the volume you set.

One voice engine

All narration runs on our neural TTS. The catalog you see is our live catalog at the moment you open the panel — searchable, with a play-preview on every voice, varying in gender and tonal character. We don't curate a count; it shifts as voices are added.

Clone your own

You're not limited to the library. Three routes: upload a short audio sample (5–15 seconds, wav/mp3/webm), record one directly in the browser, or describe the voice you want in words and have one designed. Cloning requires your explicit consent to use the sample, and you can keep up to 5 custom voices on your account — delete one to make room for another. Cloning is free within that cap. Custom voices appear in the same selector as the library and work in every project on your account. A cloned voice is scoped to your account — we don't train models on your samples.

Channel defaults

Each channel can pin a default voice — stock or cloned. Every new project under that channel opens the Voiceover stage with it pre-selected, so a recurring series keeps a consistent sound without re-picking. Override per project any time.

Multi-voice mode

Assign a different voice per character: the narrator stays consistent across the project (one voice for all narration), and named characters get their own voices, cloned or stock. The per-character voice lives on the project's Character record.

Per-chapter regenerate

Voiceover is generated one chapter at a time. If chapter 3 came out flat, regenerate only chapter 3 without touching the rest. Each regen is a fresh vendor call — charged accordingly.

Alignment

Once all chapters are voiced and merged, our alignment engine takes the merged audio and the script text and produces word-level timestamps. We need this for the next stage; it's how scenes get cut to actual speech, not estimated durations. Alignment is a separate pass at vendor cost — small line item, but it does charge.

Music (optional)

A generated background score, mixed under the voiceover at a volume you set. The music endpoint produces 2 takes per run; pick the one that fits. Music doesn't gate the next stage — you can ship without it. If the LLM-suggested style ('ambient', 'cinematic', 'lofi-hip-hop') isn't quite right, type a custom style in the input below.

When voice fails

TTS occasionally returns garbled audio (vendor model hiccup, unusual punctuation, very long sentences). The Voiceover card flags the chapter with a Failed badge. Click Regenerate; the credit for the failed call is refunded.

Picking guide, short version

Your own voice on your own channel → record 15 seconds and clone it. A character sound you can describe but can't record → voice design, in words. Just need a solid narrator → preview the library and pin the winner as the channel default.

Spend the ten minutes previewing before you generate a full chapter. A voice that's 90% right across 30 minutes of finished video is a regen bill you could have previewed away.

← PREVIOUS

Visual bible

Production