All work
Case 01Generative Media · Pipeline

Content Automation Pipeline

A scenes.json storyboard becomes a finished, captioned, fully-narrated long-form explainer video — driven by a single shell entrypoint.

Role
Architect & Engineer
Year
2026
Discipline
Generative Media
Stack
8 technologies
PythonFFmpegGeminiImagen 4Veo 3.1ElevenLabsWhisperPostgreSQL

What it does

  • Generate scene imagesStructured scenes.json drives Gemini / Imagen with a brand-locked flat-vector visual system enforced by an art-director agent’s prompt rules.
  • Generate video clipsVeo 3.1 (preview / fast / lite tiers) for motion shots that ride alongside still images.
  • Synthesize voiceoverElevenLabs or Gemini controllable TTS, picked at the command line — same scene script, swappable provider.
  • Cut captionsOpenAI Whisper into an ASS subtitle file when captions are enabled, otherwise skipped cleanly.
  • Assemble the final MP4FFmpeg — image/clip timing, voiceover mix, ambient music bed, optional burned-in captions.
  • Tally true costEvery API call’s _cost.jsonl line read through a central pricing table; per-run summary emitted.
  • Sync to PostgresOne row per video, one row per API call — for the dashboard layer that watches what each video actually cost to make.

Tech stack

Orchestration

A single run.sh driving five staged Python entrypoints (images → clips → voiceover → captions → assemble → cost → db). Idempotent by slug; re-running a stage overwrites its own artifacts and nothing else.

Generative APIs

Gemini (research, scripting, image gen, controllable TTS), Imagen 4 (image gen alt path), Veo 3.1 (video clips), ElevenLabs (TTS), OpenAI Whisper (forced-alignment captions). Every call wrapped in an isolated module so providers are swappable without touching pipeline code.

Media tooling

FFmpeg / ffprobe for mixing, normalization, duration probing, and final assembly. Whisper also used as an optional timing verifier — voiceover generation exits non-zero if a scene overruns its window.

Data layer

PostgreSQL with a two-table schema (videos, video_costs) and an idempotent upsert sync. JSONL cost ledger on disk is the source of truth; the database is a queryable mirror.

Content authoring

A markdown wiki (knowledge-base/wiki/) of production rules — video structure, hook framework, script principles, visual principles, audio principles, retention mechanics — that the content-prep agents read before generating output.

Engineering highlights

  • Slug-as-unit-of-work

    Every video lives under output/videos/<slug>/ with its own scenes.json, images, clips, voiceover, captions, cost log, and final MP4. A run is fully reproducible from the slug alone.

  • Provider-pluggable voiceover

    ElevenLabs and Gemini TTS share the same scene contract and output layout; bash run.sh <slug> <voice> [provider] picks at runtime. Adding a third provider is one wrapper module, no core edits.

  • Cost-as-data, not as logs

    Each generation tool appends a structured JSONL line per API call. A central pricing.py table maps (model, unit, count) → USD. The summarizer reads the ledger; the DB sync replays it into video_costs. Per-run cost is observable, not estimated.

  • Optional dependencies, hard failures when used

    Whisper, the database, and captions are all opt-in via env flags — but if you opt in and the prerequisite is missing, the run fails loud instead of silently degrading.

  • Knowledge base as a load-bearing artifact

    raw/ is human source material; wiki/ is the agent-readable distillation. New raw drops are ingested into wiki pages with index.md + log.md updated in lockstep — an LLM-WIKI pattern that keeps the rules current without manual curation.

  • Agent-authored prep, deterministic execution

    Strategy, Scriptwriter, Art Director, Audio Engineer, and QC agents produce the scenes.json that the deterministic Python pipeline consumes. Creative work is gated and reviewed before any paid API call fires.

What it demonstrates

  • Designing a multi-vendor generative-media pipeline (Gemini, ElevenLabs, Veo, Whisper) where every stage is independently runnable, idempotent, and observable — without a job queue, a worker pool, or external orchestration.
  • Treating cost as first-class output. Real per-video unit economics fall out of the run rather than being estimated after the fact.
  • Separation of concerns between creative spec (agents + wiki) and execution (Python tools + FFmpeg). The script doesn’t know about retention mechanics; the agents don’t know about FFmpeg flags.
  • Pragmatic stack choices: shell entrypoint over a workflow engine, JSONL on disk over a queue, Postgres for queryability rather than orchestration, optional Whisper guardrails over heavyweight QA infrastructure.

Stack at a glance

Python 3.10+ · FFmpeg · Gemini · Imagen 4 · Veo 3.1 · ElevenLabs · OpenAI Whisper · PostgreSQL · JSONL cost ledger · Markdown knowledge wiki · Claude Code orchestration