Case 01Generative Media · Pipeline

Content Automation Pipeline

A scenes.json storyboard becomes a finished, captioned, fully-narrated long-form explainer video — driven by a single shell entrypoint.

Role: Architect & Engineer
Year: 2026
Discipline: Generative Media
Stack: 8 technologies

PythonFFmpegGeminiImagen 4Veo 3.1ElevenLabsWhisperPostgreSQL

What it does

Generate scene imagesStructured scenes.json drives Gemini / Imagen with a brand-locked flat-vector visual system enforced by an art-director agent’s prompt rules.
Generate video clipsVeo 3.1 (preview / fast / lite tiers) for motion shots that ride alongside still images.
Synthesize voiceoverElevenLabs or Gemini controllable TTS, picked at the command line — same scene script, swappable provider.
Cut captionsOpenAI Whisper into an ASS subtitle file when captions are enabled, otherwise skipped cleanly.
Assemble the final MP4FFmpeg — image/clip timing, voiceover mix, ambient music bed, optional burned-in captions.
Tally true costEvery API call’s _cost.jsonl line read through a central pricing table; per-run summary emitted.
Sync to PostgresOne row per video, one row per API call — for the dashboard layer that watches what each video actually cost to make.

Tech stack

Orchestration

A single run.sh driving five staged Python entrypoints (images → clips → voiceover → captions → assemble → cost → db). Idempotent by slug; re-running a stage overwrites its own artifacts and nothing else.

Generative APIs

Gemini (research, scripting, image gen, controllable TTS), Imagen 4 (image gen alt path), Veo 3.1 (video clips), ElevenLabs (TTS), OpenAI Whisper (forced-alignment captions). Every call wrapped in an isolated module so providers are swappable without touching pipeline code.

Media tooling

FFmpeg / ffprobe for mixing, normalization, duration probing, and final assembly. Whisper also used as an optional timing verifier — voiceover generation exits non-zero if a scene overruns its window.

Data layer

PostgreSQL with a two-table schema (videos, video_costs) and an idempotent upsert sync. JSONL cost ledger on disk is the source of truth; the database is a queryable mirror.

Content authoring

A markdown wiki (knowledge-base/wiki/) of production rules — video structure, hook framework, script principles, visual principles, audio principles, retention mechanics — that the content-prep agents read before generating output.

Engineering highlights

Slug-as-unit-of-work
Every video lives under output/videos/<slug>/ with its own scenes.json, images, clips, voiceover, captions, cost log, and final MP4. A run is fully reproducible from the slug alone.
Provider-pluggable voiceover
ElevenLabs and Gemini TTS share the same scene contract and output layout; bash run.sh <slug> <voice> [provider] picks at runtime. Adding a third provider is one wrapper module, no core edits.
Cost-as-data, not as logs
Each generation tool appends a structured JSONL line per API call. A central pricing.py table maps (model, unit, count) → USD. The summarizer reads the ledger; the DB sync replays it into video_costs. Per-run cost is observable, not estimated.
Optional dependencies, hard failures when used
Whisper, the database, and captions are all opt-in via env flags — but if you opt in and the prerequisite is missing, the run fails loud instead of silently degrading.
Knowledge base as a load-bearing artifact
raw/ is human source material; wiki/ is the agent-readable distillation. New raw drops are ingested into wiki pages with index.md + log.md updated in lockstep — an LLM-WIKI pattern that keeps the rules current without manual curation.
Agent-authored prep, deterministic execution
Strategy, Scriptwriter, Art Director, Audio Engineer, and QC agents produce the scenes.json that the deterministic Python pipeline consumes. Creative work is gated and reviewed before any paid API call fires.

What it demonstrates

Designing a multi-vendor generative-media pipeline (Gemini, ElevenLabs, Veo, Whisper) where every stage is independently runnable, idempotent, and observable — without a job queue, a worker pool, or external orchestration.
Treating cost as first-class output. Real per-video unit economics fall out of the run rather than being estimated after the fact.
Separation of concerns between creative spec (agents + wiki) and execution (Python tools + FFmpeg). The script doesn’t know about retention mechanics; the agents don’t know about FFmpeg flags.
Pragmatic stack choices: shell entrypoint over a workflow engine, JSONL on disk over a queue, Postgres for queryability rather than orchestration, optional Whisper guardrails over heavyweight QA infrastructure.

Stack at a glance

Python 3.10+ · FFmpeg · Gemini · Imagen 4 · Veo 3.1 · ElevenLabs · OpenAI Whisper · PostgreSQL · JSONL cost ledger · Markdown knowledge wiki · Claude Code orchestration

Previous case · 04

Shopify Product Bulk Operations Suite

Standalone Shopify app: FastAPI on the backend, React 19 on the frontend, one process owning the full pipeline. No external CLI, no job queue glue.

Next case · 02

Ecommerce AI Discoverability Audit

Seven-pillar GEO/AEO diagnostic with evidence-backed findings, a 0–10 composite score, and a 90-day remediation roadmap — rendered to a brand-styled report.

Content Automation Pipeline

What it does

Tech stack

Orchestration

Generative APIs

Media tooling

Data layer

Content authoring

Engineering highlights

Slug-as-unit-of-work

Provider-pluggable voiceover

Cost-as-data, not as logs

Optional dependencies, hard failures when used

Knowledge base as a load-bearing artifact

Agent-authored prep, deterministic execution

What it demonstrates

Shopify Product Bulk Operations Suite

Ecommerce AI Discoverability Audit