> **⚠️ SUPERSEDED** — This document is a historical receipt. See `s7-current-understanding.md` for the authoritative current position.

# Totally Spies S7 animation style vs. video diffusion model relevance

## Purpose

This note qualifies whether the architectural differences between image
diffusion and video diffusion models are material for the Totally Spies
marketing video pipeline.

The analysis was performed using local Ollama `qwen2.5vl:7b` vision inference
against actual Season 7 frames from the official Banijay trailer, combined
with published video diffusion architecture research.

## Production provenance

Season 7 (26 × 22′) is produced by Zodiak Kids & Family (part of Banijay Kids
& Family) with animation by Ollenom Studio. The Totally Spies Wiki and
Animation Magazine both confirm the series is animated using **Toon Boom
Harmony**, which is the industry-standard software for digital cutout / puppet-
rig 2D animation. The show uses Harmony's cutout animation workflow: character
models are broken down into separate drawing layers for each movable part
(head, torso, arms, legs, hands, mouth shapes), arranged in a peg-based
hierarchy, and animated by setting keyframes on transforms (position, rotation,
scale) with Harmony interpolating the in-betweens automatically.

## Sources

### Frame sources

- Scene frames: `materials/benchmark/official-vimeo-trailer/scene_frames/`
- Motion analysis sequences: extracted at native framerate from the official
  S7 trailer at ~20s (action) and ~40s (dialogue/hold), 12 consecutive frames
  each
- Raw Qwen2.5-VL results:
  `materials/benchmark/official-vimeo-trailer/ts-vision-results.json`

### Architecture research

- LTX-Video (arXiv:2501.00103) — 3D VAE with 32×32×8 spatiotemporal
  compression
- Wan2.1 (arXiv:2503.20314) — Wan-VAE with unified spatiotemporal latent
  space
- IV-VAE (arXiv:2411.06449, CVPR 2025) — split keyframe + temporal branch
  architecture
- Stable Video Diffusion (arXiv:2311.15127) — temporal convolution and
  attention layers inserted after spatial layers
- Lumiere (arXiv:2401.12945) — Space-Time UNet for full-duration generation
- Sora technical report (OpenAI) — spacetime patch tokenization + full
  self-attention
- Video LDM (arXiv:2304.08818) — temporal layer insertion into pretrained
  2D UNet
- Empirical survey of temporal layer design (arXiv:2502.07001)

## What Qwen2.5-VL sees in the S7 frames

### Visual style (6 diverse scene frames)

| Property | Finding |
|---|---|
| Line art | Clean vector lines, uniform stroke weight, crisp geometric edges. No hand-drawn wobble or variable line texture. Consistent with Toon Boom Harmony vector output. |
| Coloring | Flat color fills (the traditional cel-painting look). Solid uniform areas with minimal gradient use (skies, water only). No painted textures, no soft airbrush rendering. Note: this is not "cel shading" in the 3D NPR sense (toon shaders on 3D geometry). The show is native 2D; the flat fill approach is inherited from classical cel-painting conventions applied in a digital ink-and-paint pipeline. |
| Character rendering | Digital cutout rigs (Toon Boom Harmony puppet-style). Character models are broken into hierarchical drawing layers per body part, animated via transform keyframes with software-interpolated in-betweens. This is the standard Harmony cutout workflow, distinct from traditional frame-by-frame hand-drawn animation. Consistent limb proportions and clean joint articulation across the entire season. |
| Backgrounds | Flat vector or lightly gradient-shaded. Static across all motion pairs. Treated as reusable environment plates, not per-frame painted artwork. |

### Motion behavior

#### Seq A — action sequence (~20s mark, 12 frames at 25 fps)

Frames compared in pairs spaced 3 frames apart (~120 ms gap).

- Characters shift stance between frames with all body parts moving in
  coordinated synchrony. The vision model interpreted this as the whole
  figure being "repositioned" or "redrawn" as a unit. In practice, Harmony
  cutout rigs animate each body-part layer on its own interpolation curve,
  but when the animator keys multiple parts on the same frames, the rendered
  output appears as a coordinated whole-body shift. The vision model cannot
  distinguish "rig transforms applied" from "character fully redrawn" based
  on rendered pixels alone.
- Background is **completely static** across all pairs.
- Movement is coordinated but simplified — low drawing count between key
  poses, consistent with cutout-rig interpolation rather than frame-by-frame
  redraw.
- Vision model identified this as **limited animation** in 2 of 3 comparisons,
  full animation in 1. The "full animation" classification in one pair
  likely reflects the vision model's inability to distinguish cutout-rig
  repositioning from traditional full redraw.

#### Seq B — dialogue / hold sequence (~40s mark, 12 frames at 25 fps)

- **Near-identical frames** across all 3 comparison pairs.
- Minimal changes: subtle expression shifts, minor hand and posture
  adjustments.
- Background, limbs, and faces **held completely still**.
- All 3 pairs identified as **limited animation** with emphasis on expression
  over motion.

### Production technique summary

- **Digital cutout animation** pipeline built in Toon Boom Harmony. Characters
  are rigged as cutout puppets with hierarchical peg structures; motion is
  created by setting keyframes on transforms and letting Harmony interpolate
  the in-betweens (often called "tweens" in this context, equivalent to the
  traditional "inbetweening" process but automated by software).
- **Limited animation**: heavy use of held drawings (also called "holds" or
  "extended exposures" — a single drawing is displayed across many consecutive
  frames). The show frequently animates on threes or fours during dialogue
  and holds, switching to twos during action beats. At the show's 25 fps PAL
  frame rate, on twos yields ~12.5 unique drawings per second; on threes
  yields ~8.3; on fours yields ~6.25. This is consistent with standard
  TV-budget limited animation practice since Hanna-Barbera.
- **Modular compositing structure**: static background plates layered
  separately from animated character layers. Background art does not change
  frame to frame within a shot. This layer separation is fundamental to the
  cutout pipeline and visually obvious in the frame analysis.

### Methodological limitations

- Qwen2.5-VL analyzes rendered composite frames and **cannot see the
  underlying rig structure**. It cannot distinguish between a cutout rig
  whose parts were transformed (Harmony's actual workflow) and a character
  that was fully redrawn frame to frame. All conclusions about "limited" vs
  "full" animation are inferred from pixel-level similarity between frames,
  not from production metadata.
- The Qwen2.5-VL "production technique summary" (analyzing three still frames
  in isolation) characterized the show as "high-framerate fluid" with "no
  visible holds" — **directly contradicting** the motion-pair analysis where
  all sequence B pairs showed near-identical held frames. This discrepancy
  demonstrates that single-frame style assessment is unreliable for temporal
  behavior; the motion-pair comparisons are the more trustworthy signal.
- Frames were extracted from H.264-compressed video. Compression artifacts
  mean that even truly held drawings will not be literally pixel-identical
  in the decoded frames.

## Relevance to video diffusion architecture

### Why temporal architecture matters for S7

The prior analysis established that video diffusion models (LTX-Video, Wan2.1,
Sora) use 3D spatiotemporal latent spaces where temporal relationships are
baked into the representation — fundamentally different from generating frames
independently.

For Totally Spies S7, this architecture is well-matched, but for a
counterintuitive reason:

1. **The temporal patterns are simple and learnable.** S7 uses limited
   animation: most frames are near-identical holds with occasional coordinated
   pose shifts. A video model's temporal compression (e.g., LTX-Video's
   8-frame temporal tokens) naturally captures this — hold frames compress to
   near-zero temporal variation in latent space, and the model learns that
   this is the expected behavior. It does not need complex physics simulation.
   It needs to learn that flat-colored shapes hold still for extended
   exposures and then transition via smooth software-interpolated in-betweens.

2. **The cutout-rig motion language is structurally regular.** Unlike
   traditional full animation (animating "on ones" with unique drawings per
   frame, as in classic Disney features), S7's character motion follows the
   predictable patterns of digital cutout animation: individual body-part
   layers rotate or translate at their peg pivot points, figures slide on
   static backgrounds, and expression changes happen via drawing substitution
   in the mouth or eye layer while the rest of the body holds. This
   regularity — geometric transforms on flat vector shapes — is exactly what
   temporal attention layers can capture efficiently.

3. **The risk is misfire, not inability.** A general-purpose video model
   trained on live-action and 3D-rendered footage will produce excessive
   inter-frame variation by default — adding physics-driven secondary motion
   (hair sway, cloth dynamics), parallax shifts, and sub-pixel noise where
   the show deliberately uses flatness and extended holds. The custom
   pipeline needs to teach the model that stillness is correct and that the
   animation density (drawings per second) is intentionally low — the show
   operates on threes, fours, or higher for much of its runtime.

4. **Frame-by-frame image generation would fail here** — not because the
   motion is complex, but because it would introduce per-frame stochastic
   variation (flickering line weights, shifting color fills, drifting
   proportions) that S7's style does not exhibit. The show's visual language
   demands that consecutive held frames be visually identical, and
   smoothly interpolated between key poses during movement. Only a model
   with temporal awareness can enforce this level of inter-frame consistency.
   (Note: "identical" here means derived from the same held drawing in the
   production pipeline. In the final H.264-encoded video, minor compression
   artifacts may introduce sub-pixel differences even within true holds.)

## Summary assessment

| Factor | Assessment |
|---|---|
| Does S7's style need a custom video model? | Yes — but to learn restraint (limited animation, extended holds, smooth cutout interpolation), not physical complexity |
| Is temporal architecture relevant? | Yes — hold-frame consistency (extended exposure) and smooth interpolated transitions are temporal phenomena that require cross-frame awareness |
| Would image-only generation plus stitching work? | No — per-frame variation would break the show's characteristic stillness |
| What is the training signal? | The model needs to learn: static backgrounds, cutout-rig pivot-based motion, deliberate extended holds, flat color fills, and modular composited shot structure |
| Primary risk | Excessive inter-frame variation. A pretrained video model's prior toward fluid realistic motion will add secondary dynamics, sub-pixel drift, and frame-to-frame noise that S7 deliberately avoids. The LoRA / fine-tune must suppress this bias toward high drawing counts and continuous motion. |

## Terminology note

This document uses animation industry terminology as follows:

- **Limited animation**: an animation process that reuses drawings, uses fewer
  unique drawings per second, and relies on held cels / extended exposures.
  Distinct from full animation where every frame is uniquely drawn. Standard
  reference: Wikipedia "Limited animation"; originated with Hanna-Barbera in
  the 1950s.
- **Cutout animation** / **puppet rig**: Toon Boom Harmony's standard
  character animation workflow where models are broken into separate drawing
  layers and animated via transform keyframes. Officially documented by Toon
  Boom as "cut-out animation." Not the same as physical paper cutout
  (Terry Gilliam style).
- **Hold** / **extended exposure**: a single drawing displayed across multiple
  consecutive frames. In dope-sheet / x-sheet terminology, this is
  "extending the exposure" of a drawing.
- **On ones / on twos / on threes**: industry-standard terms for how many
  frames each unique drawing is held. "On twos" = each drawing shown for 2
  frames = 12 unique drawings per second at 24 fps (or 12.5 at S7's 25 fps
  PAL rate). S7 frequently operates on threes or fours during dialogue.
- **Tween / in-between**: in digital cutout animation, the software-
  interpolated frames between keyframes. In traditional animation,
  "inbetweening" is the manual drawing process; in Harmony, it is automated.
- **Flat color fill**: solid uniform color areas without gradient, as in
  classical cel painting. This is not "cel shading" in the modern 3D NPR
  sense (toon shaders applied to 3D geometry).
- **Drawing count** (our shorthand: "animation density"): the number of
  unique drawings or significant pose changes per second. This is not a
  standard industry term; the industry describes this via "on twos / on
  threes" notation or simply as the drawing count per second. We use
  "animation density" as informal shorthand in this document.

## Open questions

- How many training clips of deliberate hold sequences (extended exposures)
  are needed to teach the model that near-zero inter-frame variation is the
  expected default?
- Does the temporal compression ratio of the chosen video VAE align well with
  S7's typical hold duration (often 8–12+ frames at 25 fps, i.e., animating
  on fours to sixes during dialogue holds)?
- Can LoRA training on a limited-animation dataset effectively suppress the
  base model's bias toward high animation density, or does the base model's
  prior toward fluid motion overwhelm the fine-tune?
