# IIW English episode rebuild plan

Date: 2026-05-03

Scope: proceed with the licensed IIW/Banijay package using **English-title episode masters only**. French-title alternates remain available but are out of scope until a duplicate/variant pass says they are needed.

## Files created for this pass

- `tools/build_iiw_english_source_manifest.py`
  - Scans the IIW H.264 master folder.
  - Selects one English-title `.mov` master per production episode.
  - Adds technical metadata with `ffprobe`.
  - Maps the 13 old YouTube/bible episodes to production episode numbers where possible.

- `materials/training-data/iiw_english_source_manifest.json`
  - Canonical source manifest for 26 English IIW masters.

- `docs/internal/iiw-english-episode-source-manifest.csv`
  - CSV view of the same source manifest.

- `tools/prepare_iiw_english_training_data.py`
  - New multi-episode extraction utility.
  - Reads `iiw_english_source_manifest.json`.
  - Writes a new isolated dataset under `materials/training-data/iiw-english/` by default.
  - Does **not** overwrite current YouTube-derived `clips/` or `first_frames/`.
  - Supports `--episode`, `--dry-run`, `--force`, scene threshold and clip duration controls.
  - Now includes title/opening exclusion, credits/outro exclusion, perceptual near-duplicate suppression, optional max clips per episode, and rejection reason reporting in `extraction_plan.json`.

## Current English source manifest status

- 26 English production masters indexed.
- 13 mapped to the existing YouTube/bible metadata.
- 13 marked `new_needs_bible_metadata`.
- Common source spec: 1920×1080, 25 fps, H.264, yuv420p.
- Audio: two AAC mono tracks per master.

## Existing metadata mapping

| IIW production episode | IIW English title | Existing bible mapping |
| --- | --- | --- |
| 01 | PANDAPOCALYPSE | Frankenpanda |
| 02 | IT TAKES A SLOB | It Takes A Slob |
| 03 | TOTALLY VINTAGE | Totally Vintage |
| 04 | STINK-O-RAMA | new metadata needed |
| 05 | CREEPY CRAWLY CREATURE CATCHER | Creepy Crawly Creature Catcher |
| 06 | TOTALLY TROLLING MUCH | Totally Trolling, Much? |
| 07 | OVER-SIMULATED | Over |
| 08 | IT'S TOTALLY A TEST | It's Totally a Test |
| 09 | TERRIBLE TODDLER TOYS | new metadata needed |
| 10 | TOTALLY TALENTED | Totally Talented |
| 11 | THE DAH WHO | The DAH |
| 12 | MEGA MOON CHEESE | Mega Moon Cheese |
| 13 | THE WILD LIFE | new metadata needed |
| 14 | WHAT WOOLLY MAMMOTH | What Woolly Mammoth |
| 15 | MYSTERY ON THE WOOHP EXPRESS | new metadata needed |
| 16 | PUMPKIN PARTICLE PERIL V2 | new metadata needed |
| 17 | UNDERCOVER SUPERVILLAINS | Undercover Supervillains |
| 18 | MANDYS MIND-BLOWING MAINFRAME | new metadata needed |
| 19 | OLDIES AND GOODIES | new metadata needed |
| 20 | TOTALLY PAWSOME | Totally Pawsome |
| 21 | A DOG GONE DAY | new metadata needed |
| 22 | SOMETHINGS FISHY | new metadata needed |
| 23 | FOREVER LIPTASTIC | new metadata needed |
| 24 | GLITTERSPY | new metadata needed |
| 25 | LOCKED IN SPACE PERIL | new metadata needed |
| 26 | CYBER SWEETHEART | new metadata needed |

## Extraction utility smoke test

Dry-run command:

```bash
python tools/prepare_iiw_english_training_data.py \
  --episode 01 \
  --dry-run \
  --output-dir "${TMPDIR:-/tmp}/iiw-english-dry-run"
```

Result:

| EP | Threshold | Detected cuts | Planned clips | Planned duration |
| --- | ---: | ---: | ---: | ---: |
| 01 | 0.30 | 527 | 320 | 21.2 min |
| 01 | 0.40 | 427 | 298 | 20.9 min |
| 01 | 0.50 | 333 | 253 | 19.8 min |

Observation: EP01 dry-run is more granular than the old YouTube-derived dataset for the same story area. The old dataset had 159 clips for Frankenpanda and 1,551 clips across 13 episodes. Even threshold 0.50 still yields 253 planned EP01 clips, so full-season extraction should include additional filters beyond raw scene threshold: title/credits removal, near-duplicate suppression, and maybe target caps per scene type/episode.

## Tuned licensed-English dry-runs

Updated dry-run command shape:

```bash
python tools/prepare_iiw_english_training_data.py \
  --episode 01 \
  --scene-threshold 0.60 \
  --min-duration 3.0 \
  --max-duration 7.0 \
  --exclude-start 12 \
  --exclude-end 45 \
  --dedupe-threshold 4 \
  --dedupe-window 8 \
  --dry-run \
  --output-dir "${TMPDIR:-/tmp}/iiw-english-dryruns/ep01_t060"
```

Representative dry-run results:

| EP | Title | Mapped? | Threshold | Candidates | Accepted | Rejected | Planned duration | Rejection reasons |
| --- | --- | --- | ---: | ---: | ---: | ---: | ---: | --- |
| 01 | PANDAPOCALYPSE | yes | 0.60 | 151 | 142 | 9 | 13.2 min | title=3, credits=6 |
| 05 | CREEPY CRAWLY CREATURE CATCHER | yes | 0.60 | 159 | 148 | 11 | 13.5 min | title=3, credits=8 |
| 20 | TOTALLY PAWSOME | yes | 0.60 | 160 | 154 | 6 | 14.2 min | title=3, credits=3 |
| 04 | STINK-O-RAMA | new metadata | 0.60 | 172 | 163 | 9 | 14.8 min | title=3, credits=6 |

Additional EP01 comparison:

| EP | Threshold | Min/max duration | Accepted clips | Planned duration |
| --- | ---: | --- | ---: | ---: |
| 01 | 0.50 | 2.5–7.0s | 218 | 18.0 min |
| 01 | 0.60 | 3.0–7.0s | 142 | 13.2 min |

Current recommended pilot extraction settings:

- `--scene-threshold 0.60`
- `--min-duration 3.0`
- `--max-duration 7.0`
- `--exclude-start 12`
- `--exclude-end 45`
- `--dedupe-threshold 4`
- `--dedupe-window 8`
- no hard max clip cap for the pilot, because the representative episodes land within the 120–180 clip target

Pilot candidate set:

- EP01 `PANDAPOCALYPSE` — mapped old bible/captions
- EP05 `CREEPY CRAWLY CREATURE CATCHER` — mapped old bible/captions
- EP20 `TOTALLY PAWSOME` — mapped old bible/captions
- EP04 `STINK-O-RAMA` — unmapped/new metadata stress test

This yields 607 accepted clips and about 55.7 minutes of licensed H.264 training media before captioning/filter review.

## Four-episode pilot extraction

Command used:

```bash
python tools/prepare_iiw_english_training_data.py \
  --episode 01 \
  --episode 05 \
  --episode 20 \
  --episode 04 \
  --scene-threshold 0.60 \
  --min-duration 3.0 \
  --max-duration 7.0 \
  --exclude-start 12 \
  --exclude-end 45 \
  --dedupe-threshold 4 \
  --dedupe-window 8 \
  --output-dir materials/training-data/iiw-english-pilot
```

Output:

- `materials/training-data/iiw-english-pilot/clips/` — 607 MP4 clips
- `materials/training-data/iiw-english-pilot/first_frames/` — 607 PNG first frames
- `materials/training-data/iiw-english-pilot/manifest.json`
- `materials/training-data/iiw-english-pilot/extraction_plan.json`

Validation:

| Metric | Value |
| --- | ---: |
| Clips | 607 |
| First frames | 607 |
| Total clip duration | 55.73 min |
| Resolution | 1920×1080 |
| FPS | 25 |
| Disk usage | ~2.2 GB |

Episode counts:

| EP | Title | Clips |
| --- | --- | ---: |
| 01 | PANDAPOCALYPSE | 142 |
| 04 | STINK-O-RAMA | 163 |
| 05 | CREEPY CRAWLY CREATURE CATCHER | 148 |
| 20 | TOTALLY PAWSOME | 154 |

## Pilot Wan2.2 metadata scaffold

Tool:

```bash
python tools/build_iiw_pilot_wan22_metadata.py \
  --pilot-dir materials/training-data/iiw-english-pilot \
  --max-start-delta 3.0
```

Outputs:

- `materials/training-data/iiw-english-pilot/diffsynth_metadata.jsonl` — 607 rows
- `materials/training-data/iiw-english-pilot/wan21_metadata.json`
- `materials/training-data/iiw-english-pilot/wan2.1_metadata.json`
- `materials/training-data/iiw-english-pilot/metadata_summary.json`

Caption scaffold sources:

| Source | Clips |
| --- | ---: |
| nearest old reference manifest | 326 |
| generic uncaptained IIW master prompt | 281 |

By episode:

| EP | Old-caption scaffold | Generic prompt |
| --- | ---: | ---: |
| 01 | 121 | 21 |
| 04 | 0 | 163 |
| 05 | 98 | 50 |
| 20 | 107 | 47 |

## VLM caption pass and revalidation

Tool:

```bash
python tools/caption_iiw_pilot_generic_clips.py \
  --pilot-dir materials/training-data/iiw-english-pilot
```

Revalidation command for old-reference rows:

```bash
python tools/caption_iiw_pilot_generic_clips.py \
  --pilot-dir materials/training-data/iiw-english-pilot \
  --source nearest_old_reference_manifest
```

Implementation notes:

- Uses short contact sheets sampled from each clip.
- Calls Ollama through the HTTP API with base64 image payloads.
- Can target rows by `caption_source`.
- Persists after each clip and is resumable.

Result after recaptioning both generic and old-reference scaffold rows:

| Caption source after pass | Clips |
| --- | ---: |
| Ollama VLM contact-sheet caption | 607 |
| nearest old reference manifest | 0 |
| remaining generic rows | 0 |

The VLM/QC pass marked 6 clips as `training_usable=false`, mainly opening/title-card material plus one QA-failed character-ID risk row.

## Wan2.2 pilot package metadata

Tool:

```bash
python tools/build_iiw_pilot_wan22_package.py \
  --pilot-dir materials/training-data/iiw-english-pilot
```

Outputs:

- `materials/training-data/iiw-english-pilot/diffsynth_metadata.jsonl`
- `materials/training-data/iiw-english-pilot/wan21_metadata.json`
- `materials/training-data/iiw-english-pilot/wan2.1_metadata.json`
- `materials/training-data/iiw-english-pilot/wan22_pilot_package_manifest.json`

Training metadata now excludes the 6 clips marked `training_usable=false`.

| Metric | Value |
| --- | ---: |
| Manifest clips | 607 |
| DiffSynth training rows | 601 |
| Rejected from training metadata | 6 |
| Reviewed usable identity anchors referenced for eval/reference | 76 |
| Strict TRAIN identity anchors within companion manifest | 22 |
| Low-weight/eval identity references within companion manifest | 54 |

Training metadata episode counts:

| EP | Training rows |
| --- | ---: |
| 01 | 141 |
| 04 | 162 |
| 05 | 146 |
| 20 | 152 |

Training metadata caption source counts:

| Source | Rows |
| --- | ---: |
| Ollama VLM contact-sheet caption | 601 |

## Video-only smoke subset

Tool:

```bash
python tools/build_iiw_video_smoke_subset.py \
  --per-episode 40 \
  --force
```

Output:

- `materials/training-data/iiw-english-smoke-video-only/manifest.json`
- `materials/training-data/iiw-english-smoke-video-only/diffsynth_metadata.jsonl`
- `materials/training-data/iiw-english-smoke-video-only/wan21_metadata.json`
- `materials/training-data/iiw-english-smoke-video-only/wan2.1_metadata.json`
- `materials/training-data/iiw-english-smoke-video-only/wan22_smoke_package_manifest.json`
- hardlinked media under `clips/` and `first_frames/`

Validation:

| Metric | Value |
| --- | ---: |
| Smoke rows | 160 |
| Episode balance | 40 each from EP01/04/05/20 |
| Caption source | 160 Ollama VLM contact-sheet captions |
| Video-only | yes |
| Identity plates in training | no |
| Total duration | 14.73 min |
| Media paths missing | 0 |

The smoke subset uses hardlinks for 160 MP4 clips and 160 PNG first frames, so it avoids duplicating the pilot media payload on disk.

Training wrapper validation:

```bash
python tools/run_wan22_train.py \
  --training-data-dir materials/training-data/iiw-english-smoke-video-only \
  --output-dir materials/training-data/iiw-english-smoke-video-only/wan22_checkpoints \
  --model-variant ti2v-5b \
  --lora-rank 16 \
  --epochs 1 \
  --dataset-repeat 20 \
  --learning-rate 2e-5 \
  --num-frames 81 \
  --height 480 \
  --width 832 \
  --gradient-accumulation-steps 4 \
  --dry-run
```

Dry-run status: passed locally; all 160 metadata rows and media paths validated, and the Accelerate command was rendered without starting training.

Nix helper status: `runWan22Train` builds successfully and exposes `spies-run-wan22-train`.

## Recommended next step

Run a video-only Wan2.2 smoke test from `materials/training-data/iiw-english-smoke-video-only` on a GPU host. Do not train with strict identity anchors yet; use identity plates as eval/reference only until the identity manifest is tightened. Keep the old YouTube-derived `materials/training-data/clips/` untouched.

## Do not do yet

- Do not overwrite `materials/training-data/clips/`.
- Do not train from both English and French title variants.
- Do not include `03_PROPS/03_PROPS/` alone as prop source. Use `03_PROPS.zip` for the complete props package.
- Do not commit large extracted clip folders until the extraction settings are validated.
