# Progress Report — 2026-04-04

## Session scope

This session picked up from the archived session `50f08833` (which died at 31MB
from accumulated image payloads). We've been:

1. Diagnosing and archiving the dead session
2. Building a GPU transcript pipeline (ASR + multimodal disambiguation)
3. Provisioning a fresh Hetzner RTX 6000 Ada builder
4. Running the full training data preparation pipeline end-to-end
5. Fixing runtime issues as they surfaced

## Pipeline DAG status

| Step | Type | Status | Store path |
|------|------|--------|------------|
| step1-scenes | CPU | ✅ Built | `2gdznav0…-spies-step1-scenes-0.1.0` |
| step2-transcript-base | GPU | ✅ Built | `2j5fdbzj…-spies-step2-transcript-base-0.1.0` |
| step2-bible | CPU | ✅ Built | `31ji0g9j…-spies-step2-bible-0.1.0` |
| step2-shot-reference | CPU | ✅ Built | `34qnkrnz…-spies-step2-shot-reference-0.1.0` |
| step2-transcript-disambiguate | GPU | ✅ Built | `cahb2fkv…-spies-step2-transcript-disambiguate-0.1.0` |
| step2-scene-context | CPU | ✅ Built | `i1dc7qjp…-spies-step2-scene-context-0.1.0` |
| step2-caption | GPU | ✅ Built (earlier run, needs rebuild with latest source) | |
| step2-reviewed | CPU | ✅ Built (earlier run, needs rebuild with latest source) | |
| step3-ltx2-data | CPU | ✅ Built (earlier run, needs rebuild with latest source) | |
| step3-wan21-package | CPU | ❌ Bug fixed, not yet rebuilt | zip path issue (fixed in `0c77dd6`) |
| step4-ltx2-preprocess | GPU | ⏳ Not started | requires step3-ltx2-data |
| step5-ltx2-train | GPU | ⏳ Not started | requires step4 |
| step5-wan21-train | GPU | ⏳ Not started | requires step2-reviewed |

## Commits this session (14 total)

```
0c77dd6 Fix wan21 zip: don't cd work (persists to installPhase)
8f65a19 Fix wan21 packaging: mkdir work before cp, use /* glob
8075281 Add dontUnpack to step3 packaging derivations (no src needed)
cf651b3 Fix permission error in step2-reviewed: chmod after cp from store
d702570 Add ffmpeg to step2-reviewed nativeBuildInputs
50f7f39 Make flash_attention_2 best-effort, not required
f609e3e Don't force CPU steps to GPU server; build them locally
10d3fbb Maximize remote builder utilization and add progress monitoring
a6c3f0e Make JSON parser robust against malformed model output
ac7bb9b Detect model architecture from config.json for correct class selection
a18b4ec Strip unsupported processor kwargs before generate()
b5147d1 Fix Omni generation crash: use Thinker (B) with VL fallback (C)
b994d3e Improve transcript disambiguation quality and audit detail
a003cac Make transcript review a GPU pipeline with ASR + Omni disambiguation
df63d66 Archive session 50f08833: root cause analysis and continuation plan
```

## Builder status

- **Server**: Hetzner Auction #2958772, NBG1-DC6
- **Hardware**: Xeon Gold 5412U (24c/48t), 128 GB DDR5 ECC, RTX 6000 Ada 48 GB, 2× 1.92 TB NVMe RAID0
- **Nix builder spec**: `24 maxJobs / 48 speedFactor` with `cuda` feature
- **Status**: Active, not cancelled

## Resource utilization — are we using it correctly?

### What's working well

1. **Parallel independent steps**: With `maxJobs=24`, Nix schedules step1-scenes +
   step2-bible + step2-shot-reference + step2-transcript-base concurrently. Confirmed
   via progress check showing 4 builds running simultaneously.

2. **GPU dispatching**: Only `requiredSystemFeatures = ["cuda"]` derivations go remote.
   CPU steps build wherever Nix schedules them.

3. **Model loading**: bf16 precision on Ada architecture is correct. VLM loads on GPU
   with `device_map="auto"`.

### What's NOT working well — and how to fix it

1. **CPU steps still go to the GPU server unnecessarily**.

   The `max-jobs = 0` was removed from `builder-remote.conf`, but Nix is still
   sending CPU derivations to the builder (visible in the build logs: step2-bible,
   step2-shot-reference building on ssh-ng://builder).

   **Root cause**: Without `max-jobs = 0`, Nix uses a heuristic — it sends work to
   the fastest available builder based on `speedFactor`. Our builder has speedFactor=48
   while local is 1, so Nix _prefers_ the remote for everything.

   **Fix options**:
   - Set local `system-features` to include everything except `cuda`, and make
     CPU derivations explicitly _not_ require `cuda`. This already works
     architecturally — the issue is that Nix's scheduler optimizes for speed, not
     cost. It's not wrong, just wasteful if the GPU server costs money per hour.
   - Accept it for now — the CPU steps are fast (seconds) and the real cost is
     GPU time. The overhead of sending CPU work to the remote builder is small
     compared to the value of getting the pipeline working.
   - For production: add `preferLocalBuild = true` to CPU-only derivations so Nix
     prefers local even when a faster remote is available.

2. **flash_attn not available in nixpkgs GPU env**.

   The GPU Python env (transformers 5.3.0 + torch 2.11.0) doesn't include
   `flash-attn`. Models fall back to SDPA attention, which is decent but not
   optimal on Ada architecture. ~20-30% slower than flash_attention_2 for
   large sequence lengths.

   **Fix**: Package `flash-attn` as a Nix overlay or add it to the gpuPython
   env. This is a worthwhile optimization but not a blocker.

3. **Qwen2.5-Omni is incompatible with transformers 5.3.0**.

   The Omni model crashes on two separate code paths (talker generation, mrope_section).
   Option C fallback to Qwen2.5-VL works but loses audio grounding.

   **Fix**: Either pin transformers to the Omni-compatible version
   (`v4.51.3-Qwen2.5-Omni-preview`) in a separate Python env, or wait for an
   upstream fix. For now, VL fallback is functional.

4. **Source code changes cause full pipeline rebuilds**.

   Every commit to any tool in the `src` filter causes all derivations to get new
   hashes, even if only one tool changed. This means fixing a wan21 zip bug
   triggers a full rebuild of step2-transcript-base (GPU ASR).

   **Fix**: Split the source filter per-derivation so each step only depends on
   its own tool script. This is a moderate refactor but would dramatically reduce
   rebuild waste.

5. **`devenv tasks run` doesn't see new task names**.

   The new tasks (`train:step2-transcript-base`, `models:ingest:asr`, etc.) exist
   in `devenv.nix` but `devenv tasks run` doesn't load them. This forced us to use
   direct `nix build` commands instead.

   **Fix**: Likely needs `devenv shell` reload or a `devenv.lock` update. Low
   priority since direct `nix build` works.

## Immediate next actions

1. **Rebuild step3-wan21-package** with the zip path fix (`0c77dd6`)
2. **Confirm step2-caption → step2-reviewed → step3 chain** succeeds end-to-end
3. **Inspect the actual transcript and caption outputs** to assess quality
4. **Cancel the Hetzner server** when validation is complete to stop billing

## Estimated remaining GPU time

| Step | Estimated GPU time |
|------|--------------------|
| step3-wan21-package rebuild | <1 min (CPU, zip only) |
| step4-ltx2-preprocess | ~30-60 min |
| step5-ltx2-train | ~4-8 hours |
| step5-wan21-train | ~2-4 hours |

Steps 1-3 (data prep) are essentially done. Steps 4-5 (actual model training)
are the bulk of remaining GPU time.