# Totally Spies S7 — model selection and training priorities

## Purpose

This memo turns the benchmark-pack work into practical guidance for:

1. selecting future generation models / pipelines
2. prioritizing what to train or fine-tune first
3. deciding what to evaluate before claiming the system matches S7 style

## Evidence base

This memo is based on:

- downloaded official S7 corpus:
  - `materials/reference/totally-spies/youtube-official/inventory.json`
- expanded corpus validation:
  - `docs/research/s7-animation-validation-expanded.md`
  - `materials/benchmark/youtube-s7-validation/expanded-validation-qwen2.5vl-7b.json`
- benchmark packs:
  - `docs/research/s7-benchmark-packs.md`
- earlier second-opinion pass:
  - `docs/research/s7-animation-style-second-opinion-gemma3-12b.md`
  - `materials/benchmark/official-vimeo-trailer/ts-vision-second-opinion-gemma3-12b.json`
- pack-level evaluation on `qwen2.5vl:7b`:
  - `materials/benchmark/youtube-s7-validation/packs-eval-qwen2.5vl-7b.json`
- pack-level evaluation on `gemma3:12b`:
  - `materials/benchmark/youtube-s7-validation/packs-eval-gemma3-12b.json`
  - `docs/research/s7-gemma-pack-confirmation.md`
- pack-level evaluation on `gemma4:26b`:
  - `materials/benchmark/youtube-s7-validation/packs-eval-gemma4-26b-chat.json`
  - `docs/research/s7-gemma4-pack-confirmation.md`

## What the evidence consistently says

Across trailer-only analysis, expanded corpus analysis, the benchmark packs,
and the Gemma 3 second-opinion pass, the stable conclusion is:

- S7 reads as **digital 2D cutout / rigged animation**
- the visuals are **clean, vector-like, flat-filled, and production-consistent**
- the motion discipline is **restrained**, with many held or near-held beats
- even action remains **controlled**, not densely redrawn in a full-animation
  sense
- therefore, the primary AI-video problem is **temporal consistency under a
  restrained cutout-style motion regime**, not realistic motion physics

## What matters most for model selection

### 1. The best model is not the one that makes the prettiest still

For this project, still-image richness is not enough.

A candidate model or pipeline should be preferred if it can do these things
well:

- keep faces stable across frames
- keep costume details stable across frames
- keep linework / edge behavior stable across frames
- avoid adding unnecessary secondary motion
- avoid adding texture / shading creep
- preserve static or near-static backgrounds in held beats
- keep gadgets readable and attached consistently during action

### 2. Temporal restraint matters more than realistic motion sophistication

A strong candidate model for S7 should be biased toward:

- low inter-frame drift
- strong identity persistence
- stable flat-colored surfaces
- coherent motion during small pose changes
- consistent cutout-like geometry under motion

A weaker candidate model will typically fail by:

- over-animating holds
- adding body sway, hair physics, cloth dynamics, or camera drift
- changing faces between frames
- changing costume shape / color between frames
- softening the flat 2D look into something more generic or more painterly
- introducing **texture boiling / shimmering** on static lines and surfaces
- making supposedly static backgrounds **breathe** or drift
- causing **silhouette degradation** under motion or lighting change

### 3. Motion-sheet performance matters more than still-sheet performance

One recurring lesson from the validation passes:

- still-only sheets can cause a model to over-read the style as more fluid or
  more elaborate than it really is
- motion sheets with sequential frames are much better at revealing whether a
  model understands the actual animation discipline

So when comparing future candidate models, the ranking should put more weight
on:

- `dialogue-hold` motion-sheet performance
- `action-gadget` motion-sheet performance

than on single-frame aesthetic judgment alone.

## Training priorities

### Priority 1 — character / costume consistency

Pack:

- `materials/benchmark/youtube-s7-validation/packs/character-costume-consistency/`

Why first:

- if the characters are unstable, nothing else matters
- this is the highest likely approval risk
- identity drift is the fastest way to lose franchise trust

What to optimize:

- face consistency
- silhouette consistency
- costume / wardrobe stability
- recurring supporting cast stability
- trio coherence in shared shots

Success looks like:

- Sam looks like Sam across contexts
- Clover looks like Clover across contexts
- Alex looks like Alex across contexts
- wardrobe and hero-mission outfits do not mutate between shots

### Priority 2 — dialogue / hold temporal consistency

Pack:

- `materials/benchmark/youtube-s7-validation/packs/dialogue-hold/`

Why second:

- this is where generic AI video most obviously breaks the S7 style
- low-motion scenes are supposed to feel stable, not busy
- if the model cannot hold still, it will not feel like the show

What to optimize:

- low inter-frame drift
- stable heads / torsos during held beats
- small controlled facial or mouth changes
- static or near-static backgrounds
- no invented secondary motion
- no **texture boiling** on lines, collars, or facial features
- no **background breathing** in screens, walls, consoles, or cockpit elements
- no **motion bleed** where the system invents movement inside a supposed hold

Success looks like:

- held beats remain visually stable
- changes happen only where intended
- no flicker or breathing artifacts in the body / face
- props and screens remain locked unless the shot explicitly animates them

### Priority 3 — action / gadget control

Pack:

- `materials/benchmark/youtube-s7-validation/packs/action-gadget/`

Why third:

- action is important, but the show's action is still controlled rather than
  hyper-fluid
- gadget readability is commercially important for promo beats
- this is where many models will add too much motion or lose attachment
  continuity

What to optimize:

- gadget readability
- hand / gadget continuity
- controlled motion under higher energy
- no transition from crisp 2D cutout feel into generic AI action mush
- restrained action timing rather than full-physics spectacle
- no **gadget melting** into hands, sleeves, or nearby effects
- no **color bleeding** across costume boundaries during motion
- no edge-smear that destroys the cutout / layered look

Success looks like:

- gadgets stay recognizable and legible
- action beats remain sparse and stylized
- character identity remains stable under movement pressure
- costume edges, gadget geometry, and silhouette anchors stay readable even in high-energy poses

## Recommended evaluation order for future generation tests

When testing any future candidate generation workflow, evaluate in this order:

1. **character-costume-consistency**
2. **dialogue-hold**
3. **action-gadget**

Reason:

- if identity fails, stop
- if holds fail, stop
- only then test action pressure

## Practical scoring rubric

Use a simple gate before any deeper investment:

### Gate 1 — identity

Reject if:

- faces drift across adjacent frames
- costume details mutate
- trio coherence breaks in shared shots
- silhouettes degrade across angle / lighting changes

### Gate 2 — holds

Reject if:

- held shots shimmer, breathe, or jitter
- backgrounds drift during supposed holds
- facial changes propagate unnecessarily into the whole body
- static linework shows texture boiling or crawling

### Gate 3 — action / gadgets

Reject if:

- gadgets become unreadable
- motion becomes too fluid / realistic for S7 style
- character shapes deform unpredictably under action
- props melt into hands or costume edges bleed together

## Current recommendation

### Best current selection principle

For this project, prefer models / pipelines that optimize for:

- **temporal consistency**
- **identity persistence**
- **flat 2D cutout-style stability**
- **restrained motion discipline**

rather than models that simply score well on general video realism.

### Current local-evaluation takeaway

Based on the available local validation runs:

- `qwen2.5vl:7b` is good enough to identify the key failure modes when shown
  motion-oriented benchmark material
- `gemma3:12b` confirmed the broader conclusion that S7 is a restrained
  cutout-style system and that temporal consistency matters more than
  realistic physics
- `gemma4:26b` confirmed the same ordering and sharpened the failure-mode
  language around texture boiling, background breathing, gadget melting,
  color bleeding, and silhouette degradation
- but none of these analysis models should be confused with the eventual
  generation model choice; they are evaluation aids

## Bottom line

If we want a generation system that feels like Totally Spies S7, the training
and selection priorities should be:

1. **lock identity**
2. **lock holds**
3. **lock controlled action and gadget readability**

That is the shortest path to validating the earlier conclusion in operational
terms.
