# Totally Spies AI Pipeline — Meeting Talking Points
*Quick-scan format — follow the flow top to bottom*
*Model licensing verified April 2026 — see `model-licensing.md` for full source citations*

---

## 1. Why generic AI tools won't work

- Foundation models (Sora, Kling, Midjourney, Wan2.1 base) can make *some* animation — not *this* one
- **Sam, Clover, Alex share the same body proportions** — the only differentiators are hair colour and suit colour
  - Without explicit anchoring, a 235B VLM confuses Sam and Clover in ~30% of shots
  - A generation model will do the same → brand error on every frame
- **S7 is a visual reboot** — different art style from S1–6 (cleaner linework, flat colour, anime-influenced)
  - A model trained on the back catalogue produces the *wrong* S7 look
- **New S7 characters exist nowhere in public training data**
  - Zerlina Lewis, Toby, Cyberchac, WOOHP World Singapore, the WOOHP-e device
  - You cannot prompt a foundation model into knowing what these look like
  - Only path: fine-tuning on franchise material

---

## 2. The method — Bible-First Fine-Tuning

**Three phases, deliberately sequenced:**

### Phase 1 — Build the visual bible ✅ Done
- Catalogued every shot across 13 S7 episodes before touching any generation model
- Cross-validated everything against the official Totally Spies Wiki
  - Corrected villain identities for all 13 episodes
  - Anchored Sam = orange hair + green suit, Clover = blonde + red suit, Alex = black hair + yellow suit
  - Named ~30 gadgets by their exact wiki names (e.g. WOOHP-e, Ultra-fixative Structural Foam)
  - Built 18-category canonical location hierarchy (Singapore city, WOOHP HQ, AIYA Academy, etc.)
- The bible is the **training signal** — caption quality determines output quality
  - "Woman in green suit" → model learns vague style
  - "Sam (red-orange hair, green catsuit) in WOOHP HQ, gadget insert, WOOHP-e device" → model learns the franchise grammar

### Phase 2 — Caption-supervised training clips ✅ Done
- Extracted **1,551 clips** from 13 episodes — 106 minutes of 720p footage
- Generated a VLM caption for every clip using the bible as context
  - Character visual anchors injected into every prompt → VLM cannot swap Sam and Clover
  - Episode-specific villains and gadgets injected per clip
  - Speaker-attributed dialogue (51% of 29,731 transcript words) used as grounding
- Result: 96% of clips correctly identify characters, 49% name gadgets by wiki name

### Phase 3 — LoRA fine-tuning ⏳ Ready to run
- One model targeted: **Wan2.2** (Wan-AI / Apache 2.0)
  - Genuinely hybrid: generates images AND video from the same weights
  - T2V: text prompt → video clip
  - I2V: reference frame → animated clip (character identity locked by input)
  - T2I / image mode: text → still (720p, marketing-ready quality)
  - Runs on a single RTX 4090 (5B model) or multi-GPU (14B MoE model)
- **LoRA** = Low-Rank Adaptation — a thin franchise-specific layer on top of existing model
  - Base model keeps its knowledge of motion, physics, animation
  - LoRA layer learns the S7 style, characters, locations
  - Training time: ~8–12 hours on one GPU (not weeks, not six figures)
  - Output: small portable weights (~few hundred MB), versioned, updatable
- **Single remaining dependency: licensed episode files from Banijay**

---

## 3. The specific problems we've already solved

| Problem | What we did |
|---|---|
| Character confusion (Sam/Clover/Alex) | Wiki-anchored every caption; hair + suit colour explicit in every VLM prompt |
| S7 art style vs earlier seasons | S7-only analysis; style tagged as "2D digital, flat colour, anime-influenced" in every caption |
| New S7 characters not in any dataset | Built descriptions from scratch via bible; fine-tuning teaches the model their appearance |
| Singapore setting | 18-category location hierarchy with full scene descriptions per location |
| WOOHP-e device (was misidentified as a car/tablet) | Wiki-corrected in the bible; correct in every caption where it appears |
| 13 different villain designs | Per-episode villain database; injected into captions for that episode's clips |
| Marketing shot grammar | Shot taxonomy built from the reference material; encoded into every caption |

---

## 4. What the pipeline produces in practice

- Writer/director provides a **brief** → "trio confronts villain on Singapore rooftop, Clover leads, mid-morning"
- Brief is translated into a structured prompt using our prompt bible
- **Wan2.2 T2V + LoRA** generates a 5–8 second clip directly from the prompt
- **Wan2.2 I2V + LoRA** animates from a storyboard frame or key art reference — character identity locked by the input image
- Clips are assembled into a marketing sequence
- The model knows the franchise — it doesn't need to be told "make it look like Totally Spies"

---

## 5. Pipeline state right now

| What | Status |
|---|---|
| Visual bible — 2,852 shots, 13 episodes | ✅ Done |
| Wiki cross-validation — villains, gadgets, characters | ✅ Done |
| Training captions — 1,551 clips, wiki-anchored | ✅ Done |
| Dataset packages — Wan2.2 format | ✅ Ready |
| Training infrastructure — GPU builder provisioned | ✅ Ready |
| Fine-tuning run | ⏳ Waiting for licensed episode files |
| Episodes 14–26 | ⏳ Behind paywall — monitored for YouTube release |

**When licensed files arrive → pipeline runs in 12–16 hours, end to end.**
The clips change. The captions, the pipeline, the infrastructure — all stay the same.

---

## 6. Why not the alternatives

| Alternative | Why no |
|---|---|
| Prompt-only (Midjourney / Sora / ChatGPT) | Character consistency impossible; new S7 characters unknown to any model; no franchise control |
| Off-the-shelf animation tool | No AI motion generation; labour-intensive; not scalable |
| Full model training from scratch | Weeks of compute, hundreds of thousands of dollars; unnecessary — base models already know motion |
| Licensing a competitor's animation AI | No franchise specificity; still confuses Sam and Clover; still needs the same fine-tuning |
| Fine-tuning without a bible | Generic inaccurate captions → lower output quality → brand errors |
| FLUX [dev] variants | ❌ Non-commercial license on the weights — commercial use requires a paid BFL agreement |
| LTX-Video (Lightricks) | ❌ $10M revenue threshold in the license — Banijay triggers it; double-damage penalty for breach |
| HunyuanVideo (Tencent) | ❌ License explicitly excludes EU, UK, South Korea — cannot be used legally in France |

---

## 7. Model choice — why Wan2.2

- **Only two models** are commercially clean for a French production, with no revenue threshold, no geographic restriction, and open fine-tuneable weights: **Wan2.1** and **Wan2.2**
- Every other candidate has a verified legal blocker:
  - **FLUX.1 [dev] and FLUX.2 [dev]** — non-commercial license on the weights. Direct quote: *"use for revenue-generating activity is NOT a Non-Commercial Purpose."* Commercial use requires a paid BFL agreement.
  - **LTX-Video (Lightricks)** — $10M annual revenue threshold in the license. Banijay at ~€3B/year triggers it. Penalty for breach: double damages. Cannot use without a Lightricks commercial agreement.
  - **HunyuanVideo (Tencent)** — license directly states: *"this agreement does not apply in the European Union, United Kingdom and South Korea."* Using it in France is unauthorized. Full stop.
  - **Wan2.7** — no open weights. API-only via Atlas Cloud. Cannot be fine-tuned by anyone.
- **Wan2.2 over Wan2.1** because it is a strict upgrade:
  - MoE architecture: more model capacity at the same compute cost
  - +65% more image training data → better stills quality
  - +83% more video training data → better motion quality
  - Cinematic-level aesthetic training with lighting, composition, colour tone labels
  - Same Apache 2.0 license, same fine-tuning approach, drop-in replacement
- **Wan2.2 is a genuine hybrid model** — same weights produce both images and video
  - `TI2V-5B`: Text-and-Image-to-Video — images now, video later, one training run
  - `T2V-A14B`: higher quality video generation (MoE 14B)
  - `I2V-A14B`: animate from a reference frame or storyboard

---

## 8. Budget fallback — image-only path

**If video budget is cut, the image path costs nothing extra to enable — the data work is already done.**

- Use Wan2.2's image mode: same fine-tuned LoRA, generate stills instead of clips
- Or use **FLUX.1 [schnell]** (Apache 2.0, no caveats) as a dedicated image model
  - Note: [schnell] is the *only* FLUX variant that is commercially clean — all [dev] variants are non-commercial
  - Lower quality ceiling than [dev] but fully usable
- Training data is identical for both paths: 6,645 frames + 1,551 VLM captions already built
- **Wan2.2 image mode is the better budget fallback** — one model, one training run, upgrade to video later without starting over

### What image-only produces
- Key art and promotional stills
- Character reference sheets — any character, outfit, location
- Social media posts (static)
- Storyboard frames for human animators
- Episode thumbnails, product / merchandise mockups

### What it doesn't produce
- Motion / video — Reels, TikTok, YouTube Shorts, animated promos

### The upgrade path
> *Fine-tune Wan2.2 once. Use image mode now. Switch to video mode when budget allows. Same weights, no restart.*

---

## 9. The two-tier offer

| | Image tier | Video tier |
|---|---|---|
| Model | Wan2.2 (image mode) | Wan2.2 (video mode) |
| Training | Same LoRA fine-tune | Same LoRA fine-tune |
| Training time | ~4–6 h (image data only) | ~8–12 h (clips + images) |
| Output | Stills, key art, social images | Clips, promos, animated content |
| Data prep | ✅ Done | ✅ Done |
| Licensed files needed | Yes | Yes |
| License | Apache 2.0 — no caveats | Apache 2.0 — no caveats |
| Upgrade path | → video mode when ready | — |

---

## 10. Key numbers

- **13 episodes** analysed (S7 first half, 14–26 behind paywall)
- **2,852 shots** catalogued
- **1,551 training clips** — 106 minutes at 720p
- **6,645 frames** extracted for image training
- **29,731 words** transcribed, 51% speaker-attributed
- **1,551 / 1,551 VLM captions** — 100%, wiki-anchored
- **~30 gadgets** named by exact wiki name
- **18 canonical locations** defined
- **1 model** targeted: Wan2.2 (Apache 2.0)
- **LoRA training**: ~8–12 h on one GPU once licensed material arrives
- **Full licensing research**: `docs/research/model-licensing.md`
