Skip to main content
VidFlow
 Sign inStart free50 cr
TRY
Quickmode
Director
Showcase
script.fountain
final.mp4
why?
WhyVidFlow
from $0.017
PER CR
Pricing
Diaries
Changelog
Docs
?
FAQ
⛑
Help
ai-video-pipeline-explained.diary — VidFlow
← all diaries
ENGINEERING⏱ 9 min · TUE · MAY 12 · 2026By The VidFlow Team

How our 8-stage AI pipeline works

Ten DB states, eight working stages, five UI tiles — the actual handoffs between them, with the vendors that do the work.


VidFlow's pipeline is shaped by three different views of the same thing. The Prisma schema has ten ProjectStatus values. Eight of them do work — the other two (DRAFT, COMPLETED) are bookends. The workspace UI groups those eight into five visible tiles. This post unpacks what each stage actually does, what calls it, and where the handoffs live.

Stage 1 — Ideation (DB: IDEATION). The project lands here with a topic and a channel. We pull the channel's recent uploads, score them against an outlier model, and ask an OpenRouter-routed LLM to propose a slate of titles. The creator picks one. The picked title becomes the seed for everything downstream. Nothing in the pipeline runs without it.

Stage 2 — Script (DB: SCRIPTING). This is the longest editing stage. The LLM writes an outline, breaks the outline into chapters, drafts each chapter, and writes a hook in a separate sub-step you can reroll independently. The hook lives in its own panel because hooks are the highest-leverage prose in the video — they get rerolled more than anything else.

Stage 3 — Visual Bible. Not a DB status — it shares SCRIPTING with the outline phase, but it's its own data object. Once the script is approved, the LLM extracts named characters and locations. KIE image models generate a portrait for each character (and a combined character sheet — one image with every character at consistent scale), plus a reference shot for each location. The Visual Bible is the thing that makes every later scene look like the same video.

Stage 4 — Voiceover (DB: VOICEOVER). Three TTS providers are wired: ElevenLabs (via the GenAIPro wrapper), Inworld, MiniMax. The creator picks one voice per narrator, and per character if they're using MultiVoice. Output is one merged audio file per chapter, then a single merged audio file for the project.

Stage 5 — Alignment (DB: ALIGNMENT). AssemblyAI takes the merged audio and the script text and produces word-level timestamps. We need this for the next stage — it's how scenes get cut to actual speech, not estimated durations.

Stage 6 — Shot planning (DB: SHOT_PLANNING). The LLM reads the timestamped script and decomposes it into semantic beats — HOOK, CLAIM, EXAMPLE, TURN, RESOLUTION — and emits a Shot record for each visual moment. Each shot gets an assigned location, character references, motion intent, and a draft prompt spec.

Stage 7 — Generation (DB: GENERATION). This is where the bill grows. Each shot generates an image first (KIE — nano-banana-2 or gpt-image-1.5), then a video clip from that image (KIE — Kling 2.6 or Veo3). The image-before-video order is a hard precondition; the new BullMQ worker enforces it. Generation runs in parallel across shots, bounded by your credit balance.

Stage 8 — Review (DB: REVIEW). Every shot has a thumbnail and a clip; you flip through and approve or kick back individual shots for regeneration. Nothing renders until you approve.

Stage 9 — Rendering (DB: RENDERING). FFmpeg locally is the default render path. There's a Rendi Cloud client built but not wired by default — that path is reserved for jobs too heavy for the local renderer.

Stage 10 — Completed (DB: COMPLETED). A project flips to COMPLETED once any Publication row reaches PUBLISHED. Publication is currently YouTube-only.

Where the worker fits. The Phase 1.5 BullMQ worker (`npm run worker`) is the new path for per-shot generation. Behind two feature flags — `JOBS_ORCHESTRATOR_NEW=1` and `JOBS_GENERATE_SHOT_NEW=1` — the orchestrator enqueues to a Redis-backed `generate-shot` queue and awaits per-shot results. Vendor webhooks land at `/api/{vendor}/callback`. The old synchronous path is still the default while we're proving the queue out.

Thumbnail generation runs alongside, not in line. It's its own job (`src/jobs/functions/generate-thumbnail.ts`), kicked off in parallel with shot generation. Three thumbnail variants land on the project; the creator picks one as the cover. No DB status for it — it's metadata.

That's the pipeline. Ten statuses, eight working stages, five tiles. The same thing, viewed three ways.

POSTED FROM THE FLOOR · ai-video-pipeline-explained.diary
Try VidFlow →