Three TTS providers, one alignment stage, one optional music track. Voice is where your script becomes audio, alignment makes that audio searchable to the shot planner, and music sits underneath the whole thing at the volume you set.
Three providers, no custom fine-tunes. ElevenLabs (via the GenAIPro wrapper), Inworld, and MiniMax. There are no in-house trained voices; the catalog you see is each provider's live catalog at the moment you open the panel. Counts drift as providers add and remove voices.
ElevenLabs. Strongest for emotional range and expressive delivery — pick it when you've got a single narrator and English content where prosody matters. Per-voice settings expose stability, similarity boost, and style; the defaults are not the right values for your specific narrator. Spend ten minutes tuning them before you generate a full chapter — you'll save a regen round. One Quick-mode caveat: if a Quick run lands on the ElevenLabs default with no voice ID picked, voiceover silently falls back to MiniMax with a sensible default voice so the hands-off run can't be derailed by a provider hiccup. In Director mode you pick the voice yourself, so this doesn't apply.
MiniMax. The widest model selection (speech-2.6-hd, speech-2.6-turbo, speech-02-hd, speech-02-turbo, speech-01-hd, speech-01-turbo). The hd variants are slower but cleaner; turbo is half the latency at noticeable quality loss. Multilingual support is the standout — if you're shipping in anything other than English, MiniMax is the default.
Inworld. Newer in the lineup, uses the inworld-tts-1.5-max model. Strength is real-time / streaming use cases; tradeoff is fewer voices and less prosody than ElevenLabs. Good fit for short-form where speed matters more than nuance.
Multi-voice mode. VoiceMode.MULTI lets you assign a different voice ID per character. The project-level Character model carries an optional voiceId; voiceover uses it when generating that character's lines. The narrator stays consistent across the project (one voice for all narration), and named characters get their own voices.
Per-chapter regenerate. Voiceover is generated one chapter at a time. If chapter 3 came out flat, you can regenerate only chapter 3 without touching the rest. Each regen is a fresh vendor call — charged accordingly.
Alignment. Once all chapters are voiced and merged, AssemblyAI takes the merged audio and the script text and produces word-level timestamps. We need this for the next stage; it's how scenes get cut to actual speech, not estimated durations. Alignment is an AssemblyAI pass at vendor cost — small line item, but it does charge.
Music (optional). A Suno-generated background score, mixed under the voiceover at a volume you set. The music endpoint produces 2 takes per run; pick the one that fits. Music doesn't gate the next stage — you can ship without it. If the LLM-suggested style ('ambient', 'cinematic', 'lofi-hip-hop') isn't quite right, type a custom style in the input below.
When voice fails. TTS occasionally returns garbled audio (vendor model hiccup, unusual punctuation, very long sentences). The Voiceover card flags the chapter with a Failed badge. Click Regenerate; the credit for the failed call is refunded.
Picking guide, short version. English single narrator with emotional range → ElevenLabs. Multilingual or fast iteration → MiniMax hd. Real-time / streaming → Inworld. More than one character → MultiVoice mode with per-character voice IDs.
Voice quality variance across providers is smaller than the variance from prosody settings within a single provider. Tune before you generate a full chapter.