# Training Dataset — Reference
*Single source of truth for the current YouTube-derived reference dataset. Last updated: 2026-05-03.*

---

## 2026-05-03 licensed-source update

The 1,551-clip dataset described below was built from official YouTube reference material and remains valuable for metadata, captions, evaluation, and bootstrap labels. It is no longer the final training source.

Licensed IIW/Banijay English production masters are now available and indexed in `materials/training-data/iiw_english_source_manifest.json` and `docs/internal/iiw-english-episode-source-manifest.csv`. Rebuild training clips from those English masters before the next Wan2.2 training run. Keep this dataset as a reference/evaluation set and caption scaffold.

---

## Numbers at a glance

| Metric | Value |
|---|---|
| Total clips | 1,551 |
| Duration | 106 minutes at 720p |
| Reference frames | 6,645 |
| `training_caption` | 1,551 / 1,551 — 100% |
| Characters named | 1,183 / 1,551 — 76% |
| Outfit data | 856 / 1,551 — 55% |
| Villain named | 354 / 1,551 — 23% |
| Location known | 1,551 / 1,551 — 100% |
| Scene type | 1,551 / 1,551 — 100% |
| Dialogue attributed (clip) | 605 / 1,551 — 39% |
| Dialogue attributed (segment) | 3,918 / 5,888 — 67% |
| Wiki-named gadgets | 79 / 1,551 — 5% |

---

## Clip record structure

Each entry in `manifest.json`:

```json
{
  "clip":             "clip_00048.mp4",
  "episode":          "Frankenpanda",
  "episode_id":       "7lA-b6ou8yc",
  "shot_key":         "7lA-b6ou8yc:s0080",
  "start_s":          42.16,
  "duration":         3.84,
  "width":            720,
  "height":           406,
  "fps":              25.0,

  "scene_type":       "dialogue",
  "shot_framing":     "medium-wide",
  "location":         "Singapore City",
  "location_raw":     "Singapore City",

  "characters":       ["Clover", "Sam", "Alex"],
  "outfits": {
    "Sam":    "green catsuit with white trim",
    "Clover": "red catsuit with silver accents"
  },
  "transcript":       "Sam: And now we're stranded in Singapore with this totes gross claw! Sam: Ugh!",

  "caption":          "Three spies stand on a Singapore City waterfront dock...",
  "structured_caption": "A dialogue scene from Totally Spies Season 7...",
  "training_caption": "Three spies stand on a Singapore City waterfront dock as Clover (blonde hair, red catsuit) kicks toward a large red mechanical claw while Sam (red/orange hair, green catsuit) and Alex (short black hair, yellow/gold catsuit) observe... [Dialogue scene | Location: Singapore City | Characters: Sam (red-orange hair, green catsuit), Clover (blonde hair, red catsuit), Alex (short black hair, yellow catsuit) | Dialogue: Sam: \"And now we're stranded in Singapore with this totes gross claw!\"]",

  "shot_annotation": {
    "shot_size":    "medium-wide",
    "camera_angle": "eye-level",
    "composition":  ["group shot"],
    "motion":       ["static frame"]
  },
  "caption_entities": {
    "characters":          ["Clover", "Sam", "Alex"],
    "locations":           ["Singapore City"],
    "gadgets":             ["large red mechanical claw on the ground"],
    "villain_description": ""
  },
  "confidence_notes": ["Claw function not wiki-verified"]
}
```

---

## training_caption — the field to use for training

Combines:
1. **VLM visual description** (accurate, conservative, no hallucinated names)
2. **Context tag** in square brackets: `[Scene | Location | Characters | Dialogue]`

Format:
```
{visual description} [{scene label} | Location: {canonical} | Characters: {name (hair, suit)}, ... | Dialogue: {speaker}: "{line}"]
```

Character names always include visual anchors: `Sam (red-orange hair, green catsuit)`. This ensures the model learns the visual identity, not just the abstract name.

---

## Dataset files

| File | Format | Use |
|---|---|---|
| `manifest.json` | JSON array | Complete metadata, source of truth |
| `ltx2_dataset.json` | JSON array | LTX-2 format: `{caption, media_path, location, scene_type}` |
| `ltx2_dataset.jsonl` | JSONL | Same, line-delimited (for streaming) |
| `wan21_metadata.json` | JSON array | Legacy Wan2.2 metadata for the YouTube-derived reference dataset: `{caption, media_path, first_frame, duration, location, scene_type}` |
| `iiw_english_source_manifest.json` | JSON object | Canonical licensed English production-master source manifest for the next rebuild |

The four YouTube-derived dataset files use `training_caption` as the `caption` field. The IIW source manifest indexes replacement source videos, not extracted clips.

---

## Caption methodology

### Why two passes

Pass 1 injected the episode gadget list into every VLM prompt to help name gadgets. Result: 664 gadgets named but flagged uncertain by the VLM itself — it pattern-matched bible names to ambiguous visuals.

Pass 2 (current) uses a conservative prompt:
- No gadget list injected
- No villain names injected
- Describe only what is clearly visible
- Empty over guess

Two-stage gadget naming then applies:
1. VLM returns: `"green spray bottle with orange nozzle held in right hand"`
2. Python F1 token-overlap matching against `gadget-lockdown-v3.json` appends: `(Ultra-fixative structural foam)`

False positive isolation: context tokens (screen, background, surface, floating, table) excluded from matching vocabulary. Distinctive token requirement: at least one token appearing in ≤20% of gadget descriptions must match.

### Character identification

Anchor rules (applied in order):
1. **Catsuit anchor**: `red/orange hair + green catsuit` → Sam
2. **Hair-only anchor**: `red/orange hair` alone (any outfit) → Sam
3. Caption text scan for hair+outfit co-occurrence

This recovered 591 clips where characters were in civilian clothes (the catsuit requirement alone blocked them).

### Speaker attribution

Pipeline: pyannote/speaker-diarization-3.1 + PLDA → SPEAKER_N labels → character name map → transcript segment matching.

Attribution method: max-overlap (find speaker turn with maximum seconds of overlap with each segment, threshold 30% of segment duration). Further improved by force-alignment via faster-whisper word-level timestamps.

Final: 67% of segments attributed (3,918 / 5,888).

---

## Location categories

| Category | Count | % |
|---|---|---|
| WOOHP HQ | 409 | 26% |
| Singapore City | 200 | 13% |
| Snowy Environment | 189 | 12% |
| Villain Lair | 110 | 7% |
| AIYA Academy | 109 | 7% |
| Other | 76 | 5% |
| Bubble Spy Café | 63 | 4% |
| Clothing Store | 55 | 4% |
| Spies Apartment | 50 | 3% |
| Other Indoor | 43 | 3% |
| Vehicle Interior | 40 | 3% |
| Stage / Performance | 39 | 3% |
| Beach / Waterfront | 36 | 2% |
| Space | 25 | 2% |
| Construction / Industrial | 24 | 2% |
| Museum / Cultural | 24 | 2% |
| Other Outdoor | 22 | 1% |
| Forest / Nature | 15 | 1% |
| Restaurant / Food | 13 | 1% |
| Desert | 9 | 1% |

---

## Scene type distribution

| Type | Count | % |
|---|---|---|
| dialogue | 556 | 36% |
| comedy-reaction | 343 | 22% |
| gadget-reveal | 237 | 15% |
| action | 182 | 12% |
| location-establish | 142 | 9% |
| transformation | 59 | 4% |
| transition | 8 | 1% |
| chase | 20 | 1% |
| other | 4 | <1% |

---

## Character distribution

| Character | Clips | % |
|---|---|---|
| Clover | 598 | 39% |
| Sam | 582 | 38% |
| Alex | 444 | 29% |
| Mandy | 137 | 9% |
| Toby | 128 | 8% |
| Zerlina | 104 | 7% |
| Jerry | 80 | 5% |

Note: Alex at 29% reflects a genuine show bias (she is less present in S7 than Sam/Clover), not a dataset error.

---

## Remaining gaps

| Gap | Count | Can improve? |
|---|---|---|
| Anonymous clips | 164 (11%) | No — these are prop/hand/effect shots with no visible faces |
| Multi-villain uncertain | 83 | Marginally — group shots without distinguishing features |
| Dialogue unattributed (clip) | 946 (61%) | Partially — GPU force-alignment would push further |
| Licensed clip rebuild | 26 English masters | Yes — source masters are available and indexed, but clips still need re-extraction |
| New production episodes without old bible metadata | 13 eps | Yes — generate new captions/story metadata from IIW masters |
| Alex under-represented in YouTube clips | 29% vs 39% | Motion screen-time remains show-biased, but IIW design sheets can improve Alex identity balance in image/keyframe training |

---

## Cross-reference files

| File | Contents |
|---|---|
| `gadget-lockdown-v3.json` | 136 gadget entries, 13 episodes, visual descriptions + F1 matching |
| `villain-database-final.json` | 19 villain entries, per-episode |
| `villain-visual-db.json` | Villain visual descriptions + episode defaults |
| `speaker-character-maps.json` | SPEAKER_N → character name, all 13 episodes |
| `locations-canonical.json` | 18-category canonical map |
