# Training Architecture Proposal: Custom Nix Image + Remote Builder

This document evaluates the proposal to use a custom Nix image (like `cultguard-agents`) as a remote build target for ML training, enabling unified local/remote training workflows.

---

## The Proposal

**Goal:** Unified training workflow where local vs. remote is just a configuration flag:

```bash
# Local training
nix run .#train -- --config ltx2_video_lora.yaml --gpu local

# Remote training (Lambda Labs)
nix run .#train -- --config ltx2_video_lora.yaml --gpu lambda

# Remote training (RunPod)
nix run .#train -- --config ltx2_video_lora.yaml --gpu runpod
```

**Architecture:**
```
┌─────────────────────────────────────────────────────────────┐
│  Custom Nix Image (based on cultguard-agents pattern)       │
│                                                             │
│  - Nix + CUDA + PyTorch pre-installed                       │
│  - LTX-2 training dependencies                              │
│  - Training script as Nix derivation                        │
│  - Deployable to: Lambda Labs, RunPod, Local                │
└─────────────────────────────────────────────────────────────┘
```

---

## Analysis: Pros and Cons

### ✅ **Pros**

1. **Unified Workflow**
   - Same command for local and remote training
   - No mental context switching
   - Easier to document and teach

2. **Reproducible Environment**
   - Nix ensures identical dependencies everywhere
   - No "works on my machine" issues
   - Version-controlled environment

3. **Nix Caching**
   - Dependencies cached across instances
   - Faster instance spin-up (after first build)
   - Can pre-build image with all deps

4. **Infrastructure as Code**
   - GPU instances are ephemeral, defined in Nix
   - Easy to replicate training setup
   - GitOps-friendly

5. **Leverages Existing Pattern**
   - Similar to `cultguard-agents`
   - Team already familiar with pattern
   - Reuses existing tooling

---

### ❌ **Cons and Challenges**

1. **Training ≠ Building**
   - **Nix builds:** Stateless, <30 min, reproducible
   - **ML training:** Stateful, 4-8 hours, checkpointed
   - **Fundamental mismatch** in abstraction

2. **Checkpoint/Resume**
   ```bash
   # What happens if Nix build fails at hour 6?
   nix build --ssh user@a100-instance .#train-model
   # ❌ Build fails, all progress lost
   # ❌ No automatic checkpoint resume
   # ❌ Pay for 6 hours of failed training
   ```

3. **Monitoring & Debugging**
   ```bash
   # Standard Nix build output:
   building '/nix/store/...-train-model.drv'...
   
   # What you actually need:
   - GPU utilization (nvidia-smi)
   - Loss curves (wandb/tensorboard)
   - Checkpoint status
   - ETA to completion
   - Ability to pause/resume
   ```

4. **Long-Running Process Issues**
   - SSH connection timeouts (hours-long builds)
   - No intermediate output streaming
   - Can't attach/detach like tmux
   - Hard to debug mid-training

5. **Cost Inefficiency**
   ```
   Nix build overhead: ~5-10 minutes
   Training time: 4-8 hours
   Cost per run: $10-20 (A100 at $2.50/hr)
   
   Overhead cost: ~$0.50 per training run
   (for Nix build setup that provides minimal value)
   ```

6. **GPU Access in Nix Builds**
   - Nix doesn't natively handle GPU resources
   - No standard way to request "1× A100 80GB"
   - Would need custom builder configuration
   - Breaks Nix's "pure build" model

---

## Technical Feasibility

### Can It Be Done? **Yes, but...**

**Minimal Implementation:**

```nix
# flake.nix
packages.train-ltx2 = pkgs.stdenv.mkDerivation {
  name = "ltx2-training-job";
  
  buildInputs = [ ltx2Env pkgs.tmux ];
  
  # This is where it gets hacky...
  buildPhase = ''
    # Start training in background (detached)
    tmux new -d -s training \
      "${ltx2Env}/bin/python" \
      "-m" "ltx_trainer.train" \
      "--config" "${configFile}" \
      "--output" "$out/checkpoints"
    
    # Wait for training to complete (blocks for hours)
    while tmux has-session -t training 2>/dev/null; do
      sleep 60
      # Check if still training
      if ! tmux list-panes -t training | grep -q python; then
        break
      fi
    done
    
    # Copy outputs
    mkdir -p $out/checkpoints
    cp -r ./checkpoints/* $out/checkpoints/ 2>/dev/null || true
  '';
  
  # Don't fail if training crashes (you want the checkpoints)
  dontInstall = true;
};
```

**Problems with This Approach:**

1. **TMUX in Nix Build?** - Very non-standard
2. **Polling Loop** - Defeats purpose of Nix's build model
3. **No Checkpoint Resume** - If build fails, start over
4. **Output Handling** - Where do logs go?
5. **Timeout** - Nix may kill long builds

---

## Better Approach: Hybrid Pattern

### ✅ **Recommended: Nix for Environment, Shell for Training**

Instead of making training a Nix **build**, make it a Nix **development environment** + shell script:

```nix
# flake.nix
{
  # Training ENVIRONMENT (builds in ~30 min)
  packages.ltx2-environment = pkgs.mkDerivation {
    name = "ltx2-environment";
    buildInputs = [ ltx2Env ];
    
    buildPhase = ''
      # Create training runner script
      cat > $out/bin/run-ltx2-training <<'EOF'
      #!/usr/bin/env bash
      set -euo pipefail
      
      # Activate environment
      export PYTHONPATH=$out/lib/python3.12/site-packages
      
      # Start training (with checkpointing)
      exec python -m ltx_trainer.train "$@"
      EOF
      chmod +x $out/bin/run-ltx2-training
    '';
  };
  
  # Training ORCHESTRATOR (not a build, just a script)
  apps.train = {
    type = "app";
    program = "${trainingOrchestrator}/bin/train-ltx2";
  };
}
```

**Training Orchestrator Script:**

```bash
#!/usr/bin/env bash
# bin/train-ltx2

set -euo pipefail

# Parse arguments
GPU_TARGET="${GPU_TARGET:-local}"
CONFIG_FILE="$1"

case "$GPU_TARGET" in
  local)
    # Run locally
    exec ./run-ltx2-training --config "$CONFIG_FILE"
    ;;
    
  lambda|lambdalabs)
    # Deploy to Lambda Labs
    INSTANCE_ID=$(create-lambda-instance)
    scp -r ./training-data user@$INSTANCE_ID:~/
    scp ./run-ltx2-training user@$INSTANCE_ID:~/
    ssh user@$INSTANCE_ID "
      cd ~
      tmux new -d -s training './run-ltx2-training --config $CONFIG_FILE'
      echo 'Training started on $INSTANCE_ID'
      echo 'Attach with: ssh user@$INSTANCE_ID tmux attach -t training'
    "
    ;;
    
  runpod)
    # Deploy to RunPod
    POD_ID=$(create-runpod-pod)
    # Similar pattern...
    ;;
    
  *)
    echo "Unknown GPU target: $GPU_TARGET"
    exit 1
    ;;
esac
```

**Usage:**
```bash
# Local training
nix run .#train -- ltx2_video_lora.yaml

# Remote training (Lambda)
GPU_TARGET=lambda nix run .#train -- ltx2_video_lora.yaml

# Remote training (RunPod)
GPU_TARGET=runpod nix run .#train -- ltx2_video_lora.yaml
```

---

## Comparison: Pure Nix vs Hybrid

| Aspect | Pure Nix Build | Hybrid (Recommended) |
|--------|----------------|---------------------|
| **Training as** | Nix derivation | Shell script |
| **Environment** | Built into derivation | Nix package |
| **Checkpointing** | ❌ Not supported | ✅ Native (training script) |
| **Monitoring** | ❌ Build logs only | ✅ tmux, wandb, nvidia-smi |
| **Resume** | ❌ Start over | ✅ From checkpoint |
| **SSH Timeout** | ❌ Build fails | ✅ tmux survives |
| **Cost on Failure** | ❌ Lose all progress | ✅ Keep checkpoints |
| **Unified Workflow** | ✅ Yes | ✅ Yes |
| **Nix Purity** | ✅ Yes | ⚠️ Environment only |
| **Implementation** | ❌ Complex hacks | ✅ Straightforward |

---

## cultguard-agents Pattern: What Actually Works

Looking at the `cultguard-agents` pattern, what made it successful:

### ✅ **Good Patterns to Reuse:**

1. **Environment as Nix Package**
   ```nix
   packages.agent-env = mkDerivation { ... };
   ```
   - Builds reproducibly
   - Can deploy to any machine
   - Works great for agents

2. **Runner Script**
   ```bash
   nix run .#run-agent -- --config my-agent.yaml
   ```
   - Unified local/remote
   - Not a "build", just environment + script

3. **Remote Deployment**
   ```bash
   # Script handles:
   # 1. Spin up instance
   # 2. Copy environment
   # 3. Run agent in tmux
   # 4. Stream logs back
   ```

### ❌ **What Doesn't Translate to ML Training:**

1. **Agent Runtime ≠ Training Job**
   - Agents: Long-running, but stateless
   - Training: Long-running, **stateful** (checkpoints)

2. **Build Time**
   - Agents: Deploy once, run forever
   - Training: Deploy, run 4-8 hours, shut down

3. **Failure Mode**
   - Agents: Restart from scratch (stateless)
   - Training: **Must resume from checkpoint**

---

## Recommended Architecture for Totally Spies

### **Phase 1: Environment as Nix Package**

```nix
# flake.nix
{
  # Training environment (builds in 30 min)
  packages.ltx2-environment = pkgs.mkDerivation {
    name = "ltx2-environment";
    # ... PyTorch, CUDA, dependencies
  };
  
  # Training runner (shell script)
  apps.train-ltx2 = {
    type = "app";
    program = "${trainingScript}/bin/train-ltx2";
  };
}
```

### **Phase 2: Orchestrator Script**

```bash
#!/usr/bin/env bash
# tools/train-ltx2

GPU_TARGET="${GPU_TARGET:-local}"

case "$GPU_TARGET" in
  local)
    nix develop .#ltx2 --command python -m ltx_trainer.train "$@"
    ;;
    
  lambda)
    # 1. Build environment once
    nix build .#ltx2-environment
    
    # 2. Launch Lambda instance with base Ubuntu
    INSTANCE=$(lambda-launch --gpu a100-80gb)
    
    # 3. Copy training data + script
    scp -r result/ training-data/ user@$INSTANCE:~/
    
    # 4. SSH and setup (one-time, 30 min)
    ssh user@$INSTANCE <<'ENDSSH'
      # Install Nix
      sh <(curl -L https://nixos.org/nix/install) --daemon
      
      # Load environment
      nix profile install ~/result
      
      # Start training in tmux
      tmux new -d -s training 'python -m ltx_trainer.train --config ltx2_video_lora.yaml'
      
      echo "Training started!"
      echo "Attach: ssh user@$INSTANCE tmux attach -t training"
      echo "Monitor: ssh user@$INSTANCE watch -n 30 nvidia-smi"
    ENDSSH
    ;;
    
  runpod)
    # Similar pattern for RunPod
    ;;
esac
```

### **Phase 3: Custom Image (Optional, for Scale)**

Only if running 5+ training runs:

```bash
# 1. Do first training with Phase 2
# 2. If successful, snapshot the instance
# 3. Upload as custom image to Lambda
# 4. Use custom image for subsequent runs

lambda create-image --instance $INSTANCE --name totally-spies-ltx2
```

---

## Implementation Plan

### Week 1: Base Infrastructure

```bash
# 1. Create ltx2-environment package
nix build .#ltx2-environment

# 2. Test locally
nix develop .#ltx2 --command python -m ltx_trainer.train --config test.yaml

# 3. Create orchestrator script
./tools/train-ltx2 --help
```

### Week 2: Remote Integration

```bash
# 1. Lambda Labs integration
GPU_TARGET=lambda ./tools/train-ltx2 --config ltx2_video_lora.yaml

# 2. Test end-to-end
# - Instance launch
# - Environment setup
# - Training start
# - Checkpoint save
# - Instance shutdown
```

### Week 3: Production Hardening

```bash
# 1. Checkpoint resume
./tools/train-ltx2 --resume-from ./checkpoints/step-5000.safetensors

# 2. Monitoring integration
./tools/train-ltx2 --monitor wandb

# 3. Custom image (if needed)
./tools/train-ltx2 --create-image
```

---

## Verdict

### **Should You Use Custom Nix Image as Remote Builder for Training?**

**Answer: ⚠️ Partially**

| Component | Use Nix? | Why |
|-----------|----------|-----|
| **Environment** | ✅ **Yes** | Reproducible, cached, versioned |
| **Training Job** | ❌ **No** | Stateful, long-running, needs checkpointing |
| **Orchestration** | ⚠️ **Hybrid** | Nix for env, shell for deployment |
| **Custom Image** | ✅ **Yes** (optional) | For repeated training runs |

---

### **Recommended Pattern:**

```bash
# ✅ DO: Use Nix for environment
nix build .#ltx2-environment

# ✅ DO: Use shell script for orchestration
./tools/train-ltx2 --gpu lambda --config ltx2_video_lora.yaml

# ✅ DO: Use tmux for long-running training
ssh user@instance 'tmux attach -t training'

# ❌ DON'T: Make training a Nix build
nix build --ssh user@instance .#train-model  # Will have problems

# ✅ DO: Create custom image for repeated runs
./tools/train-ltx2 --create-image  # After validating setup
```

---

### **Feels Like cultguard-agents Because:**

- ✅ Same Nix environment pattern
- ✅ Same unified local/remote workflow
- ✅ Same deployment orchestration

**But Different Because:**

- ❌ Training is stateful (agents are stateless)
- ❌ Training needs checkpointing (agents restart fresh)
- ❌ Training is finite (agents run forever)

---

## Related Files

- `flake.nix` - Base Nix configuration
- `docs/internal/nix-cloud-jobs.md` - Cloud job architecture
- `docs/internal/cloud-provider-recommendation.md` - Provider comparison
- `../cultguard-agents/` - Reference implementation (environment pattern)
