# Training Data

## Contents

| File | Description |
|---|---|
| `manifest.json` | 1,551 clips with structured captions, bible metadata, episode/shot IDs |
| `clips/` | 720p H.264 MP4 clips (avg 4.1s, 105 min total, 611 MB) |
| `first_frames/` | First frame PNG per clip |

## Source

Current `clips/` and `first_frames/` were extracted from official S7 YouTube downloads (reference/validation only).
These are NOT the final licensed training media.

Licensed IIW/Banijay English production masters are now indexed in:

- `materials/training-data/iiw_english_source_manifest.json`
- `docs/internal/iiw-english-episode-source-manifest.csv`

Next rebuild:
1. Four-episode licensed pilot extracted under isolated `materials/training-data/iiw-english-pilot/` using tuned settings: threshold `0.60`, duration `3.0–7.0s`, exclude first `12s`, exclude final `45s`, dedupe threshold `4`.
2. Wan2.2/DiffSynth metadata is available in `iiw-english-pilot/diffsynth_metadata.jsonl` with 601 video training rows; 6 VLM/QA-flagged unusable clips are excluded from training metadata.
3. All 607 pilot manifest rows have been captioned with the Ollama VLM contact-sheet workflow; no old-reference scaffold captions remain in training metadata.
4. Package metadata references the reviewed usable character identity anchor manifest for eval/reference only: 76 usable anchors, including 22 strict TRAIN anchors that still need tightening before training use.
5. Video-only smoke subset prepared under `iiw-english-smoke-video-only/` with 160 balanced rows (40 each from EP01/04/05/20); Wan2.2 wrapper dry-run validation passed. Run this subset on a GPU host before full-pilot training and do not train with strict identity anchors until the identity manifest is tightened.

## Character identity plates

Licensed IIW character plate derivatives are kept outside the legacy `clips/` set:

- `iiw-character-plates-pilot/png_2048/` — main trio PNG exports
- `iiw-character-plates-secondary/png_2048/` — supporting character/entity PNG exports
- `iiw-character-identity/manifest.json` — 80 candidate identity-anchor PNGs
- `iiw-character-identity/review/ollama_vlm_identity_plate_review.csv` — Ollama VLM review
- `iiw-character-identity/review/train_identity_manifest.vlm_reviewed.json` — 22 strict TRAIN identity anchors
- `iiw-character-identity/review/usable_identity_manifest.vlm_reviewed.json` — 76 usable TRAIN + low-weight EVAL_ONLY rows, EXCLUDE rows removed
- `iiw-character-identity/review/contact_sheets_decision/` — decision-overlaid contact sheets for human spot-check

Use these as controlled identity anchors/references only; do not let them dominate episode-frame or video training.

## Caption status

- `structured_caption`: 1,551 / 1,551 ✅ (assembled from bible data)
- `caption` (VLM): 0 / 1,551 — run `step2-caption` to generate

## Dataset format

Each manifest entry has:
```json
{
  "clip": "clip_00000.mp4",
  "duration": 4.1,
  "episode_id": "7lA-b6ou8yc",
  "shot_id": "s0007",
  "characters": ["Sam", "Clover"],
  "outfits": {"Sam": "green catsuit", "Clover": "red catsuit"},
  "location": "WOOHP HQ",
  "scene_type": "dialogue",
  "transcript": "Sam: We gotta get back in spy shape.",
  "structured_caption": "A dialogue scene from Totally Spies...",
  "story_context": "Episode context for VLM captioning...",
  "caption": ""
}
```
