# IIW English pilot QA / revalidation summary

Date: 2026-05-04

Scope: revalidation after recaptioning all old-reference scaffold rows with the Ollama VLM contact-sheet workflow.

## Recaption status

All old-reference scaffold rows have been replaced with VLM contact-sheet captions.

| Metric | Value |
| --- | ---: |
| Pilot manifest clips | 607 |
| `ollama_vlm_contact_sheet` captions | 607 |
| Remaining `nearest_old_reference_manifest` captions | 0 |
| VLM caption status `ok` | 607 |
| Clips marked `training_usable=false` | 6 |
| DiffSynth training rows after filtering | 601 |

Tooling used:

```bash
python tools/caption_iiw_pilot_generic_clips.py \
  --pilot-dir materials/training-data/iiw-english-pilot \
  --source nearest_old_reference_manifest

python tools/build_iiw_pilot_wan22_package.py \
  --pilot-dir materials/training-data/iiw-english-pilot
```

## Revalidation QA command

```bash
python tools/qa_iiw_pilot_dataset.py \
  --pilot-dir materials/training-data/iiw-english-pilot \
  --identity-manifest materials/training-data/iiw-character-identity/review/train_identity_manifest.vlm_reviewed.json \
  --old-sample 0 \
  --vlm-sample 40 \
  --identity-sample 10 \
  --force
```

Outputs:

- `materials/training-data/iiw-english-pilot/qa/pilot_qa_review.json`
- `materials/training-data/iiw-english-pilot/qa/pilot_qa_review.csv`
- `materials/training-data/iiw-english-pilot/qa/pilot_qa_summary.json`
- `materials/training-data/iiw-english-pilot/qa/contact_sheets/*.png`

## Revalidation QA counts

Video QA sample:

| Bucket | PASS | WARN | FAIL | Total |
| --- | ---: | ---: | ---: | ---: |
| Flagged unusable clips | 1 | 1 | 3 | 5 |
| VLM-caption sample | 29 | 10 | 1 | 40 |
| **Total video** | **30** | **11** | **4** | **45** |

Identity-anchor QA sample:

| Bucket | PASS | WARN | FAIL | Total |
| --- | ---: | ---: | ---: | ---: |
| Strict TRAIN identity anchor sample | 1 | 6 | 3 | 10 |

## Interpretation

1. Caption quality improved materially after replacing old-reference scaffolds.
   - Before: old-reference sample had 6 FAIL / 20 and 10 WARN / 20.
   - After: VLM-caption sample has 1 FAIL / 40 and 10 WARN / 40.

2. The remaining video FAILs are mostly expected excluded rows.
   - 3 FAILs are clips already flagged `training_usable=false` and excluded from DiffSynth metadata.
   - 1 FAIL is a sampled training row with likely character misidentification: `ep05_clip_0074.mp4`.

3. The identity-anchor strict TRAIN set is still too permissive.
   - Many sampled strict TRAIN items are turnarounds/reference sheets.
   - QA recommends downgrading several strict TRAIN items to `EVAL_ONLY`.
   - One Clover strict TRAIN sample has visible `SEASON 8` text and should be removed/downgraded for S7 identity training.

## Patch applied after revalidation

`ep05_clip_0074.mp4` was conservatively excluded from training metadata after QA flagged likely character misidentification.

Patch details:

- manifest row retained for audit
- `training_usable=false`
- `training_exclusion_reason=qa_revalidation_fail_wrong_character_risk`
- rebuilt `diffsynth_metadata.jsonl`, `wan21_metadata.json`, `wan2.1_metadata.json`, and `wan22_pilot_package_manifest.json`

Current training metadata after patch:

| Metric | Value |
| --- | ---: |
| Manifest clips | 607 |
| Training metadata rows | 601 |
| Excluded rows | 6 |

Training rows by episode:

| EP | Rows |
| --- | ---: |
| 01 | 141 |
| 04 | 162 |
| 05 | 146 |
| 20 | 152 |

## Remaining issues before smoke training

- Use the video dataset first; do not train with the strict identity-anchor manifest as-is.
- Treat identity plates as eval/reference only until that manifest is tightened.
- Run a small smoke subset before spending on the full 601-row pilot.

## Recommendation

The video-caption gate is now clear enough for a video-only smoke test. The identity-anchor gate is not cleared for strict TRAIN use; use identity plates as eval/reference only until that manifest is tightened.
