# Proposal feasibility review — 2026-04-01

Purpose: internal first-principles review of the external `Totally Spies` marketing-model proposal.

This note is meant to pressure-test the proposal technically and commercially before using it externally.

## Executive summary

- **Text agent:** high feasibility, low technical risk.
- **Image model:** medium to high feasibility for solo characters, closeups, props, and locations.
- **Video model:** medium feasibility for short anchored motion clips; low to medium for multi-character motion and prop interaction.
- **Main conclusion:** the proposal is directionally valid, but the hardest outputs should be treated as explicit validation gates rather than assumed baseline deliverables.
- **Most important commercial adjustment:** asking for `300 minutes` of source runtime is materially safer than `100 minutes`.

## First-principles assessment

### 1) Text model / agent

This is the safest part of the proposal.

What it really is:
- a prompt and workflow orchestration layer
- a brief-to-shot-structure interpreter
- a naming / prompt-rules / storyboard support system

What makes it feasible:
- this is standard LLM agent engineering, not frontier model research
- the hard part is not whether it can exist, but whether its outputs map cleanly into the image and video generation workflows

Main risk:
- the agent is only as good as the prompt recipes and evaluation loops behind it
- if the image or video stack is unstable, the agent will look less useful externally than it really is

Assessment:
- high feasibility
- low compute risk
- moderate integration risk

### 2) Image model

This is feasible within the current scope if the source material is clean and well distributed across the chosen assets.

What is realistically strong:
- hero closeups
- medium character shots
- full-figure hero stills
- key art
- location beauty frames
- prop / gadget beauty shots

What is plausible but needs tuning:
- duo compositions
- character plus gadget interaction
- repeated campaign families with controlled framing

What is hardest:
- trio compositions with all three leads reading cleanly in one frame
- object interaction where hand-to-prop logic must look correct
- maintaining the same character identity across multiple framing families without drift

Why this is hard:
- multi-character composition is still an active research problem
- stacked LoRAs or multi-concept conditioning create concept confusion and weakening
- hand-object fidelity is still a persistent failure mode across image models

Assessment:
- solo character stills: high feasibility
- duo stills: medium to high feasibility
- trio stills: medium to low feasibility
- prop interaction stills: medium feasibility

### 3) Video model

This is the highest-risk track in the proposal.

What is realistically achievable:
- short motion clips
- image-anchored motion from approved keyframes
- modular editorial elements that can be cut together into marketing pieces

What is risky:
- multi-character motion with identity lock
- trio shots in motion
- prop interaction in motion
- dialogue closeups in motion where face, hair, and mouth shapes must stay stable
- matching image-model look exactly inside the video model

First-principles reason:
- video generation combines all the hard parts of image generation with temporal consistency
- if identity is unstable in stills, it gets worse in motion
- if object interaction is unstable in stills, it gets worse once hands and props move
- video models are better treated as short-shot engines than long-sequence engines in this context

Assessment:
- anchored single-character short clips: medium to high feasibility
- short multi-shot marketing assembly: medium feasibility
- trio motion and prop interaction: low to medium feasibility

## Scope review

Original proposal ask:
- 6 characters
- 5 gadgets / props / objects
- 3 locations
- 100 minutes of runtime

Internal conclusion:
- the asset counts are acceptable
- the runtime ask was too thin
- `300 minutes` is a materially better minimum because it increases:
  - usable frame diversity
  - outfit and expression coverage
  - overlap between characters, props, and locations
  - the odds of extracting enough clean material for the hardest validations

Recommendation:
- keep the asset counts
- raise runtime from `100` to `300` minutes
- ask for scripts and storyboards for that runtime where available
- ask for clean environment plates wherever they exist

## Timeline review

### Proposed timeline
- first 3 weeks: first agent, first image pass, first video generation pass, output tests
- following 3 to 6 weeks: tuning, validation, final delivery

Assessment:
- image and agent tracks fit this timeline
- video track may fit only if:
  - the selected assets are well covered in the source runtime
  - the first video pass is image-anchored rather than purely prompt-driven
  - the hardest shot families are validated early

Main blind spot:
- the proposal should internally assume that **video is the pacing item**
- the first three weeks should be used to prove or disprove the video path quickly

## Budget review

### Proposal frame
- total: `$25,000`
- 4 equal payments

Internal assessment:
- commercially reasonable if scope stays constrained
- labor likely dominates cost more than compute
- compute cost is manageable if training remains LoRA-based and the infra stays efficient
- the budget becomes fragile if video tuning requires many retries

Main commercial blind spot:
- the proposal does not say what happens if the video model underperforms while image + agent outputs are strong

Internal recommendation:
- define fallback internally before kickoff:
  - if video quality fails the bar after first-pass validation, move to an image-first + anchored-motion delivery posture rather than burning the schedule on open-ended retries

## Blind spots to keep in mind

### Technical blind spots
- trio compositions are the hardest still-image deliverable
- hand / prop interaction remains a persistent failure mode
- image-model look and video-model look may not match exactly
- multi-concept conditioning weakens identity fidelity
- caption quality and curation quality strongly affect final outputs
- motion clips should be treated as modular shot units, not continuous scenes

### Data blind spots
- runtime quantity alone is not enough; per-concept coverage matters
- if the 300 minutes do not materially cover the chosen 6 + 5 + 3 scope, the dataset will still be weak
- scripts and storyboards are valuable because they help:
  - align captions
  - identify priority shots
  - map intended action to actual frames
  - reduce ambiguity during curation

### Commercial blind spots
- the proposal does not define acceptance criteria explicitly
- the proposal does not state ongoing inference / hosting expectations
- the proposal does not state who owns the trained weights and workflow artifacts
- the proposal does not spell out what is outside scope in post-production beyond the high-level wording

## Recommended validation priorities

These are the outputs most worth validating first because they are both commercially important and technically difficult:

1. trio hero grouping
2. prop or gadget interaction
3. dialogue / reaction closeup
4. short anchored motion clip from an approved keyframe

If these work, the rest of the marketing system is much more believable.

## Suggested internal qualification language

The proposal is strongest if treated as:
- a first-cycle marketing model package
- designed for short-form campaign outputs
- with image deliverables as the base certainty layer
- and video deliverables as a validated short-shot layer, not an unlimited motion promise

## Bottom line

From first principles, the proposal is **credible** if qualified correctly.

It is strongest as:
- a text agent + image model package with a video extension
- focused on short modular marketing outputs
- backed by a larger runtime ask (`300 minutes`)
- with explicit attention on the hardest validations early

It is weakest if interpreted as:
- guaranteed multi-character motion reliability across many shots
- guaranteed prop interaction fidelity in motion
- guaranteed image/video style lock without iteration

## Research references used for this review

- **Fine-Tuning Open Video Generators for Cinematic Scene Synthesis: A Small-Data Pipeline with LoRA and Wan2.1 I2V** — arXiv:2510.27364
- **ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models** — arXiv:2505.07652 / CVPR 2025 poster
- **The Chosen One: Consistent Characters in Text-to-Image Diffusion Models** — arXiv:2311.10093
- **LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models** — arXiv:2403.11627
- **Multi-LoRA Composition for Image Generation** — arXiv:2402.16843
- **MC-LoRA: Fast Modular Composition for Multi-Character Diffusion Generation** — OpenReview 2025
- **Controllable Video Generation: A Survey** — arXiv:2507.16869
- **Identity-Preserving Text-to-Video Generation by Frequency Decomposition** — arXiv:2411.17440
- **Anchored Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models** — arXiv:2512.16371