# Model comparison: Gemma 4 26B local vs Qwen3-VL 235B cloud
# Frame corpus: 25 stratified frames (22 S7 episodes + 3 S1-6 comparison)

## Verdict: Qwen3-VL 235B cloud

**Use Qwen3-VL 235B cloud for all further visual analysis on this project.**

---

## Head-to-head results

| Metric | Gemma 4 26B (local) | Qwen3-VL 235B (cloud) |
|---|---|---|
| Avg latency per frame | 89s | 20s |
| Empty / blank responses | **25 / 25** | 0 / 25 |
| Structured format compliance | 0% (all empty) | 100% |
| Character identifications | n/a | present but unreliable (see below) |

**Gemma 4 local was completely non-functional for this task.** Every single frame
produced an empty `content` field. The model exhausted its token budget in
`thinking` and returned nothing. This is the same thinking-mode behavior we
found earlier when using the `/api/generate` endpoint — the chat API nominally
returns a response but the visible content is blank when the model runs out
of predicted tokens before finishing its internal reasoning trace.

**Qwen3-VL 235B cloud answered every frame** with the full structured format
in ~20 seconds — 4.5× faster than Gemma 4 and actually functional.

---

## Critical finding from the Qwen3-VL outputs

Beyond model selection, the frame analysis revealed something more important
about the project itself.

### S7 character designs are not reliably identifiable as Totally Spies

Qwen3-VL misidentified characters in **6 of 25 frames (24%)**:

| Frame | Correctly: Totally Spies S7 | Model identified as |
|---|---|---|
| f00236 | Sam, Clover, Alex | *Penny Proud, Trudy Proud* |
| f01075 | S7 character | *Starfire (Teen Titans)* |
| f00205 | S7 characters | *Blossom, Bubbles, Buttercup (Powerpuff Girls)* |
| f00566 | S7 character | *Starfire (Teen Titans)* |
| f01110 | S7 characters | *Blossom, Buttercup (Powerpuff Girls 2016)* |
| f00348 | S7 character | *Trudy Proud (The Proud Family)* |

These are not random guesses. The model is assigning the frames to other
**contemporary rigged 2D TV animation shows** — Powerpuff Girls 2016,
Teen Titans, The Proud Family. These are all shows with similar:

- modern digital cutout/rigged production pipelines
- flat coloring with subtle shadows
- medium-thick consistent outlines
- stylized character proportions

### What this means

**Season 7's visual style has converged toward a generic modern rigged 2D
TV animation aesthetic.** It is no longer as visually distinct from other
contemporary Western animated series as the earlier traditionally-animated
seasons were.

A 235-billion-parameter frontier vision model, seeing the frames cold, cannot
reliably distinguish S7 Totally Spies characters from characters in other
shows of the same production era and technique.

This is a direct consequence of the S1–6 → S7 production shift (traditional
→ rigged), which we confirmed from research. When you move from hand-drawn
frame-by-frame animation to rigged digital cutouts, the resulting output
tends to look more similar to other shows using the same pipeline.

### Design era classification: Qwen3-VL's readings

Even when not misidentifying specific characters, Qwen3-VL attributed S7
frames to other shows' eras or reboot productions:

- "2023 My Adventures with Superman"
- "2016 Powerpuff Girls reboot"
- "2015 Jem and the Holograms"
- "Teen Titans mid-2000s"

Only when characters were unambiguously in their hero outfits was the model
certain it was seeing Totally Spies.

### Implications for the Cultshot pipeline

This finding revises the project risk profile upward:

1. **Identity stability is harder than we said.** We described it as "the
   first gate." It is actually *harder than first-gate* — the franchise's
   visual identity in S7 is not robustly distinct from peer shows in the
   same production style. A generic model will confuse these characters
   with characters from other rigged 2D shows.

2. **Fine-tuning on character identity is not optional.** Without it, any
   video generation system will drift toward generic modern rigged animation
   aesthetics — which is exactly what Qwen3-VL is doing when it reads
   these frames as other shows.

3. **The specific visual markers that make this Totally Spies** — the
   exact eye style, the exact proportions, the specific costume design —
   need to be the primary training signal, not just "clean 2D cutout
   animation."

4. **S1–6 traditional animation is more visually distinctive.** The model
   correctly identified early-season frames as "early 2000s Totally Spies."
   The franchise had a stronger visual fingerprint under traditional animation.

---

## For the remaining 478-frame stratified run

**Use Qwen3-VL 235B cloud** via the same chat API endpoint with
`model: qwen3-vl:235b-cloud`.

Resume command:
```
python3 /tmp/ts_s7_stratified_run.py
```
(script already saves progress and will skip the 25 done frames)

Update the model name in the script from `gemma4:26b` to `qwen3-vl:235b-cloud`
and re-run.