# Totally Spies S7 — Current Understanding
*Authoritative reference. Last updated: April 2026.*

---

## Project goal

Build a franchise-specific AI model (Wan2.2 LoRA) that generates Totally Spies Season 7 content on demand — keyframes, storyboards, and eventually animated clips — from a text brief or storyboard frame. Single remaining dependency: licensed episode files from Banijay.

---

## Season 7 facts

- **Setting:** Singapore (2024). AIYA Academy replaces Beverly Hills school.
- **Episodes:** 26 total. 13 available (official YouTube). 13 behind paywall.
- **Format:** 2D digital animation — clean vector linework, flat colour fills, anime-influenced character design. NOT cel-shaded CG.
- **Network:** Cartoon Network / Max (USA), France TV (France)
- **First aired:** 2024 (France), 2025 (USA)

---

## Character anchors — wiki-verified, do not change

| Character | Hair | Outfit | Role |
|---|---|---|---|
| **Sam** (Samantha) | Red / orange | GREEN catsuit | Spy, smart/logical |
| **Clover** | Blonde | RED catsuit | Spy, fashion-forward |
| **Alex** (Alexandra) | Short black | YELLOW / gold catsuit | Spy, athletic |
| **Jerry Lewis** | Grey | Dark business suit | WOOHP founder, consultant |
| **Zerlina Lewis** | Dark brown | Red blazer + professional | WOOHP World president |
| **Toby** | Black, dark complexion | Blue/casual lab attire | Gadget engineer |
| **Mandy** | Dark | Fashionable (not a spy) | School rival |
| **Cyberchac** | — | Panda-themed emoji visor, high-tech suit | S7 AI overarching villain |
| **Glitterstar / Mei Lin** | — | Café aesthetic | Bubble Spy Café manager |

**Critical:** Sam and Clover share the same body proportions. The only reliable visual differentiators are hair colour and suit colour. Without explicit anchoring, vision models confuse them ~30% of the time.

---

## S7 villain database — wiki-verified

| Episode | Villain(s) |
|---|---|
| Frankenpanda | Maya |
| Totally Pawsome | Cleocatra |
| The DAH | Piper Maverick |
| Undercover Supervillains | Caitlin, Dominique, Number 14 |
| Over | Shmagi, Ramesh |
| It Takes A Slob | Slob, Bernice |
| Totally Talented | Pink Ice, Muscles Malone |
| Creepy Crawly Creature Catcher | Cyberchac |
| Totally Vintage | Flambe |
| It's Totally a Test | Bjorn |
| Totally Trolling, Much? | Marco Lumiere, Shirley Lumiere |
| Mega Moon Cheese | Jacques Montague |
| What Woolly Mammoth? | Yanni Cross-stitch |

---

## Key gadgets — wiki-verified

| Gadget | Visual |
|---|---|
| WOOHP-e | The spies' new car in S7 (white vehicle) |
| Ultra-fixative Structural Foam | Green spray bottle with orange nozzle |
| Digitized Atomic Bangle | White fluffy wristband |
| Electromagnetic Hair Straightener | Green oval-shaped goggles (worn on head) |
| Yo-yo Lasso | Small white projectile on a rope |
| Spiked Heels | Red high-heeled shoes (worn by spy) |
| Ballpoint pen | Small pink/purple pen-like device |
| Bluetooth High-Tech Moo Box | Small cylindrical blue/white device |
| Collapsible Electronic Egg Whisker | Transforming silver/metallic device |

Full gadget DB: `materials/benchmark/youtube-s7-validation/bible/cross-reference/gadget-lockdown-v3.json`

---

## Canonical locations (18 categories)

WOOHP HQ · Singapore City · AIYA Academy · Bubble Spy Café · Villain Lair · Spies Apartment · Vehicle Interior · Beach / Waterfront · Space · Snowy Environment · Stage / Performance · Restaurant / Food · Museum / Cultural · Forest / Nature · Desert · Clothing Store · Construction / Industrial · Other Indoor · Other Outdoor

---

## Training dataset — final state (April 2026)

### Source material
- 13 S7 full episodes (official YouTube, reference only)
- 13 S7 episodes blocked behind paywall (Cartoon Network / Max)
- Licensed episode files: **not yet received from Banijay** — single remaining dependency

### Quantitative summary

| Metric | Value |
|---|---|
| Shots catalogued in bible | 2,852 |
| Training clips extracted | 1,551 (106 min, 720p) |
| Reference frames | 6,645 |
| Transcript segments | 5,888 (29,731 words) |
| Attributed segments | 3,918 / 5,888 (67%) |
| `training_caption` field | 1,551 / 1,551 (100%) |
| Characters named | 1,183 / 1,551 (76%) |
| Outfit data | 856 / 1,551 (55%) |
| Villain named | 354 / 1,551 (23%) |
| Location known | 1,551 / 1,551 (100%) |
| Scene type | 1,551 / 1,551 (100%) |
| Wiki-named gadgets | 79 / 1,551 (5%) |

### training_caption format

The primary training field. Combines visual accuracy + narrative context:

```
{VLM visual description} [Scene: {type} | Location: {canonical} | Characters: Sam (red-orange hair, green catsuit), ... | Dialogue: Sam: "..."]
```

Example:
```
Three spies stand on a Singapore City waterfront dock as Clover (blonde hair, red catsuit) kicks toward a large red mechanical claw while Sam (red/orange hair, green catsuit) and Alex (short black hair, yellow/gold catsuit) observe; modern glass tower skyline behind them. [Dialogue scene | Location: Singapore City | Characters: Sam (red-orange hair, green catsuit), Clover (blonde hair, red catsuit), Alex (short black hair, yellow catsuit) | Dialogue: Sam: "And now we're stranded in Singapore with this totes gross claw!"]
```

### Caption methodology

**Two-pass VLM strategy:**

Pass 1 (structured_caption): Bible context injected — wiki character anchors, episode villain/gadget lists, location hierarchy, shot taxonomy. Produced narrative-rich captions but caused ~49% gadget over-identification (VLM pattern-matched bible names to ambiguous visuals).

Pass 2 (caption — current): Conservative prompt — no gadget list, no villain names injected. VLM describes only what it can clearly see (visual description only). Two-stage gadget naming: VLM describes visually → Python F1 token-overlap matching appends wiki name where confident.

**Key improvements applied:**
- Hair-only character inference for civilian clothes (+591 IDs)
- Villain episode-defaults for 8 single-villain episodes
- Multi-villain keyword + VLM matching for 5 multi-villain episodes
- Force-alignment (faster-whisper word timestamps) for attribution: 52% → 67% segment level
- Location VLM pass: 39% Unknown → 0% Unknown
- Context-token stop list to eliminate gadget false positives

### Training files

| File | Description |
|---|---|
| `materials/training-data/manifest.json` | Full metadata, 1,551 clips |
| `materials/training-data/ltx2_dataset.json` | LTX-2 training format |
| `materials/training-data/ltx2_dataset.jsonl` | Same, line-delimited |
| `materials/training-data/wan21_metadata.json` | Wan2.2 training format ← primary |

### Remaining gaps

| Gap | Count | Reason |
|---|---|---|
| Anonymous clips | 164 (11%) | Prop shots, hand close-ups, gadget holograms — no face visible |
| Multi-villain uncertain | 83 | Group shots with no distinguishing individual features |
| Dialogue unattributed (clip) | 946 (61%) | Lines in music/gaps between diarization turns |
| Episodes 14–26 | 13 episodes | Behind Cartoon Network / Max paywall |

---

## Model selection — Wan2.2

**Why not the alternatives (verified from source license docs):**

| Model | Blocker |
|---|---|
| FLUX.1 / FLUX.2 [dev] | Non-commercial weights. Direct quote: *"use for revenue-generating activity is NOT a Non-Commercial Purpose."* Commercial use needs paid BFL agreement. |
| LTX-Video (Lightricks) | $10M annual revenue threshold. Banijay ~€3B/year triggers it. Penalty: double damages. |
| HunyuanVideo (Tencent) | License text: *"this agreement does not apply in the European Union, United Kingdom and South Korea."* Cannot be used in France. |
| Wan2.7 | No open weights. API-only via Atlas Cloud. Cannot be fine-tuned. |
| FLUX.1 [schnell] | Apache 2.0 ✅ but lowest quality tier (4-step distilled, not production-grade) |

**Wan2.2 — Apache 2.0, no caveats:**
- No revenue threshold
- No geographic restriction
- *"We claim no rights over your generated contents"* (Wan-AI README)
- Genuine hybrid: same weights → images and video
- `TI2V-5B`: image+video, runs on single RTX 4090
- `T2V-A14B`: higher quality video, needs A100 80GB

---

## Pipeline — ready to run

### What runs locally (CPU)
- `devenv tasks run train:step1-scenes` ✅ — scene detection, clip extraction
- `devenv tasks run train:step2-bible` ✅ — story bible build
- `devenv tasks run train:step2-shot-reference` ✅ — shot taxonomy

### What requires GPU builder
- `train:step2-transcript-base` — whisper ASR (CUDA)
- `train:step2-caption` — Qwen2.5-VL captioning (CUDA)
- `train:step5-wan21-train` — Wan2.2 LoRA training (CUDA)

### Models ingested (Nix store)
- `asrModel`: whisper-large-v3-turbo
- `omniModel`: Qwen2.5-Omni-7B
- `qwenModel`: Qwen2.5-VL-7B-Instruct
- `wan21Model`: Wan2.2-TI2V-5B ← **not yet downloaded, will fetch on builder**

### When licensed files arrive

```bash
# 1. Provision GPU builder
devenv tasks run builder:server:order:execute-cheapest

# 2. Extract clips from licensed material
devenv tasks run train:step1-scenes

# 3. Caption (Qwen2.5-VL locally or reuse existing training_caption)
# Existing captions are high quality — skip step2-caption if acceptable

# 4. Package
devenv tasks run train:step3-wan21-package

# 5. Train
devenv tasks run train:step5-wan21-train

# 6. Decommission builder when done
devenv tasks run builder:server:cancellation:execute-now
```

Full details: `docs/research/gpu-training-runbook.md`

---

## Bible structure

```
materials/benchmark/youtube-s7-validation/bible/
  master-index.json                  — episode index
  catalog-shots-patched.jsonl        — 2,852 shots with parsed VLM data
  reid-merged.json                   — character IDs per shot key
  outfit-reid-results.jsonl          — outfit descriptions per shot
  final/
    locations-canonical.json         — 18-category location map
    speaker-character-maps.json      — SPEAKER_N → character name per episode
  cross-reference/
    gadget-lockdown-v3.json          — wiki gadget visuals (136 entries, 13 episodes)
    villain-database-final.json      — villain names per episode
    villain-visual-db.json           — villain visual descriptions + episode defaults
  episodes/{id}/
    shots.json                       — shot boundaries
    frames/                          — 720p keyframes
    transcript.json                  — raw whisper transcript
    transcript-with-speakers.json    — speaker-attributed transcript
    diarization.json                 — pyannote speaker turns (PLDA)
```

---

## Key decisions log

| Decision | Rationale |
|---|---|
| Wan2.2 over FLUX [dev] | Apache 2.0 vs non-commercial; also hybrid image+video in one model |
| Conservative VLM prompt | No gadget list injection → eliminated 664 false gadget identifications |
| Two-stage gadget naming | VLM visual description + Python token-overlap matching → no hallucinated names |
| Hair-only character inference | Catsuit-only anchor was too strict; blocked 487 civilian clips from being named |
| Max-overlap attribution | Better than midpoint containment for transcript → speaker matching |
| Force-alignment (faster-whisper) | Word-level timestamps: 52% → 67% segment attribution |
| Episode-default villain matching | 8 single-villain episodes: any villain description → episode villain name |

---

## Related docs

| Doc | Contents |
|---|---|
| `docs/research/model-licensing.md` | Full license analysis with source citations |
| `docs/research/dataset-state.md` | Complete manifest field reference |
| `docs/research/gpu-training-runbook.md` | Step-by-step training guide |
| `docs/research/meeting-talking-points.md` | Meeting talking points (Banijay/Laurent) |
| `docs/research/meeting-executive-summary.md` | One-page executive summary |
| `docs/strategy/monthly-content-system.md` | Content production strategy |