Two days. That's how long it took to go from "we need a better way to generate film clips" to a full 8-phase production protocol with provider abstraction, validation rubrics, cost-impact gates, and a post-production reference doc.

Two days is not a long time to build a pipeline. It's barely enough time to discover what the problems actually are.

But we didn't start from zero. We started from two films, 941 source files, and a growing pile of things that had already broken.


The Context: Two Films, Zero Shared Infrastructure

By June 2026, we had two active AI film projects:

The Primordial Stroke — 62 shots, ~10 minutes runtime, 790 files. A psychological drama about a paralyzed artist pulled backward through nine ancestral art forms. Generated mostly with Seedance 2.0 via OpenRouter.

Camino del Genesis — 151 files. A separate, shorter project with a different visual style.

Each had its own generator scripts. Its own provider configuration. Its own set of scattered validation documents. Its own error handling — or lack thereof.

They shared exactly one thing: both had things that had gone wrong.

This is not unusual for AI projects. You start with a script, it works well enough, you add another, you move on. Before you know it, you have five Python files all calling slightly different APIs with slightly different payload formats and slightly different auth headers. And none of them can talk to each other.

The catalyst for the pipeline wasn't ambition. It was the failures.


Every Problem Has A Name

The Content Moderation Wall

We hit this first. Seedance 2.0 on Replicate has content moderation filters that flag certain words: "bone", "cave", "ancestor", "tribe", "prehistoric". These are not unusual film terms. But when paired with a reference image, the combination triggered an E005 error every time.

The workaround was simple but limiting: text-only prompts, no reference images. This meant every generation was independent — no frame chaining, no visual continuity. Each shot had to describe its entire visual world in text, hoping the model would maintain consistency.

The Cost Estimation Mistake

This one hurt. Video generation on Seedance 2.0 costs about $0.75 per 5-second shot. But the initial calculation used GPU compute time at L40S rates ($0.000975/s), giving a quoted cost of ~$0.16/shot.

The actual cost was 7x higher. The mistake was assuming billing tracked GPU time rather than output duration. Replicate charges per second of generated video, not per second of compute. A 14-second emotional shot costs closer to $2.80.

This is one of those mistakes you only make once. After doing the math correctly — 62 shots at actual rates — the total came to $95-115 for the full film. Manageable, but the gap between expectation and reality was instructive.

The API Key Limit Mid-Chain

Sequential generation means shots depend on each other — shot N's last frame becomes shot N+1's reference. This imposes a linear pipeline: you cannot parallelize. So when the OpenRouter API key hit its rate limit 4 shots from the end of the chain, the entire generation stopped.

Not with a graceful pause. With a 403 error: "Key limit exceeded (total limit)."

The chain had been running for hours. Shots 1-58 were complete. Shot 58a and 58b were blocked. There was no resume mechanism, no checkpoint system, no budget pre-check.

The Workspace Reorg Data Loss

During a routine workspace reorganization, some film assets got orphaned. No catastrophic loss, but enough to realize there was no single source of truth for anything. Provider logic lived in 6 scripts across 2 projects. Validation rubrics were scattered across 5 overlapping documents. Post-production commands existed only as terminal history.

The Validation Loop That Owned Itself

The original validation system used validate-scene.py — a static Python script that made direct API calls to 3 providers, parsed natural language output, detected convergence and oscillation, made pass/fail decisions, and wrote rewrite prompts. The agent was a passenger.

Ghassan called this "old pre-agent style" and pushed to flip the architecture. Static scripts should not own orchestration loops. They should generate structured prompts and provide parsing utilities. The agent should make the decisions.

This was the architecture review that changed everything.


The Human In The Loop

Here is the truth that the pipeline won't fix: agents are confidently wrong.

Every failure above was discovered and corrected by a human. Not by a validator model, not by a rubric, not by a pipeline gate. By a person reading the output and saying "this doesn't look right."

The content moderation workaround was manual. The cost recalculation was manual. The architecture decision to flip the validation loop was manual. The human restarted the chain after the API key limit, then built the resume mechanism for next time.

Agents think they have the full picture. They don't. They make plausible-sounding mistakes — wrong stack assumptions, overengineered solutions, skipped edge cases — with total confidence. An agent reviewing its own output will almost always approve it.

This is not a bug. It's a structural limitation. The model cannot see what it cannot see.

What this means in practice:

The pipeline is complicated because every complication has a real failure behind it. The provider abstraction exists because of content moderation. The cost-impact gate exists because of the GPU rate mistake. The no-retry rule exists because retries mask provider failures. The agent-owned validation loop exists because the script-owned one was brittle.

None of these were premeditated. They were all reactive.


What The Pipeline Actually Does

The unified protocol that emerged from these 2 days has 8 phases:

Phase What It Does What Broke Before
0: InitScaffold project, write film profile JSONNo standard structure, scattered files
1: PitchLogline, treatment, type classificationNone (pre-production was fine)
2: Pre-ProCharacter sheets, environment plates, referencesNo shared visual library
3: Shot MatrixContinuity ledger, prompt assembly, readiness gatePrompt inconsistency across 62 shots
4: GenerationSequential chain gen, no retries, cost gatesAPI limits, moderation blocks, no resume
5: ValidationWriter model → Gemini validator per artifactScript-owned loop was brittle
6: Post-Proffmpeg concat, cross-dissolve, audio sync, color gradeCompletely undocumented, reverse-engineered
7: WrapCost report, lessons, archiveNo close-out process

Key architecture decisions baked into the pipeline:

Provider abstraction (ABC-first). VideoProvider and ImageProvider abstract base classes define submit(), poll(), build_payload(). Adding a provider means subclassing — not rewriting scripts.

Single source of truth. scripts/film/providers.py lives in one place, symlinked into each film project. Fixes apply everywhere.

Sequential generation with no retries. Each clip fires exactly once. If it fails, log it and move on.

Cost-impact gate. Before any provider switch, compare the delta against remaining budget. If delta > 15% of budget, flag for human approval.

Validation is a per-phase artifact. Each phase produces structured output that the next phase validates. No overlapping checks, no gaps.


Consistency And Flow: What Improved

Before the pipeline, every shot was effectively a one-off. Prompt wording drifted shot-to-shot. Character descriptions varied. Camera angles changed unprompted. The same environment looked different in consecutive frames.

After the pipeline, the shot matrix enforces locked atomic blocks — character, environment, and camera descriptions that never vary across shots. The continuity ledger tracks per-shot state: position, props, lighting, costume. The chain generator passes the previous shot's last frame as the next shot's reference, enforced by sequential generation.

The result is not perfect consistency. AI video models still drift, especially across long chains. But the drift is measurable now, and the protocol has a drift detection step every 5 shots to re-anchor against master references.

Flow validation — checking that consecutive scenes cut logically — went from an afterthought to a formal gate. Props carry across cuts. Lighting temperature shifts are intentional instead of random. Emotional arcs track logically between scenes.


What Went Wrong This Time

A few things that didn't make it into the protocol:

The OpenRouter key limit. Still a risk. The protocol documents a budget pre-check but doesn't enforce key-level rate limits. If you're generating 62 shots in one chain and your key hits its total limit, the chain dies.

Drift detection is documented but not automated. The actual comparison is still done by eyeballing frames. A proper image similarity metric would catch drift before it compounds.

Post-production is documented but not scripted. The ffmpeg commands exist as a reference doc. Assembly still requires manual invocations.

The cost-impact gate is a rule, not a check. There's no automated budget tracker that triggers it. It depends on the operator checking.


Next Steps

  1. Lightweight MCP tool for scene context. Replace the prompt generator script with a direct tool call. Eliminates file-based context assembly.
  2. Drift detection automation. Add a frame similarity check (SSIM or CLIP-based) to the generation loop. Auto-trigger re-anchoring when similarity drops below threshold.
  3. Resumable generation from any point. Add a state file that captures generation progress and allows restart from any shot, not just the last one.
  4. Post-production script. Package the ffmpeg commands into a single assemble-film.sh that reads an assembly manifest.
  5. Key budget monitoring. Pre-check API key remaining credit before starting a long chain.

What I'd Tell Someone Starting Today

If you're building an AI film pipeline, start with the simplest thing that generates a single clip. Then generate two clips in sequence. Then handle the failure when the second clip doesn't match the first.

Don't design the validation rubric before you've seen what AI video actually breaks on. Don't abstract the provider before you've hit the one that blocks your vocabulary. Don't write scripts that own loops the agent should own.

The pipeline will find its own shape. The failures tell you where it needs to go.


Filed under: AI film, pipeline design, postmortem