# Cloud Training Package (Pure Nix Approach) Since local CUDA in NixOS is complex, this approach prepares a **self-contained training package** that can be uploaded to any cloud GPU provider. ## What's Included The package (`cloud-training-package/wan22-smoke-training.tar.gz`) contains: - 160 training video clips (smoke test subset) - Training metadata (DiffSynth format) - Training script (`train.py`) - Launch script (`launch.sh`) - Python dependencies (`requirements.txt`) - Usage instructions (`README.md`) **Package size:** ~600 MB ## How It Works ``` ┌─────────────────────────────────────┐ │ NixOS Machine (Your Bastion) │ │ - Pure Nix data preparation │ │ - No CUDA needed │ └──────────────┬──────────────────────┘ │ │ Upload tarball ▼ ┌─────────────────────────────────────┐ │ Cloud GPU Provider │ │ - Ubuntu/Debian with CUDA drivers │ │ - PyTorch + DiffSynth installed │ │ - RTX 4090 / A100 GPU │ └──────────────┬──────────────────────┘ │ │ Download trained LoRA ▼ ┌─────────────────────────────────────┐ │ NixOS Machine │ │ - Use trained weights for inference│ └─────────────────────────────────────┘ ``` ## Recommended Providers | Provider | GPU | Hourly Rate | Est. Cost | |----------|-----|-------------|-----------| | **RunPod** | RTX 4090 | $0.70 | $2-3 | | **Vast.ai** | RTX 4090 | $0.40-0.60 | $1.50-2.50 | | **Lambda Labs** | A100 40GB | $1.50 | $4-6 | **Expected training time:** 2-3 hours for smoke test ## Step-by-Step Guide ### 1. Create Training Package ```bash cd /home/workspaces/totally-spies-cultshot nix-shell -p python3 --run "python3 tools/prepare-cloud-training-package.py" ``` Output: `cloud-training-package/wan22-smoke-training.tar.gz` ### 2. Rent GPU Instance **Example: RunPod** ```bash # Go to https://runpod.io/console/ # Select: RTX 4090 (24GB) # Template: PyTorch 2.5 + CUDA 12.4 # Deploy ``` You'll get: - SSH access - Jupyter notebook - ~50GB disk space ### 3. Upload Package ```bash # From your bastion machine scp cloud-training-package/wan22-smoke-training.tar.gz root@:~/ # SSH into GPU instance ssh root@ # Extract tar xzf wan22-smoke-training.tar.gz cd wan22-smoke-training ``` ### 4. Download Wan2.2 Model ```bash # Install huggingface CLI pip install huggingface-hub # Login (accept terms first at https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B) huggingface-cli login # Download model (~15GB) huggingface-cli download Wan-AI/Wan2.2-TI2V-5B --local-dir /models/Wan2.2-TI2V-5B ``` ### 5. Run Training ```bash bash launch.sh ``` Training will: - Install Python dependencies (~5 min) - Load Wan2.2 model (~2 min) - Train for 1 epoch on 160 clips (~2-3 hours) - Save LoRA checkpoints to `./output/` ### 6. Download Results ```bash # From your local machine scp root@:~/wan22-smoke-training/output/*.safetensors ./materials/trained-loras/ # Terminate GPU instance to stop billing ``` ## Nix Integration This approach is **pure Nix** because: - ✅ Data preparation happens in Nix shell - ✅ Package is reproducible (same inputs → same tarball) - ✅ No CUDA/NVIDIA dependencies on NixOS - ✅ Training environment is disposable cloud instance ## Cost Tracking Create a simple cost tracker: ```bash cat > track-costs.sh << 'SCRIPT' #!/bin/bash START_TIME=$(date +%s) HOURLY_RATE=${1:-0.70} # Default $0.70/hr for RTX 4090 echo "Training started at $(date)" echo "Hourly rate: \$${HOURLY_RATE}" echo "" while true; do NOW=$(date +%s) ELAPSED=$((NOW - START_TIME)) HOURS=$(echo "scale=2; $ELAPSED / 3600" | bc) COST=$(echo "scale=2; $HOURS * $HOURLY_RATE" | bc) echo -ne "\rElapsed: ${HOURS}h Cost: \$${COST} " sleep 60 done SCRIPT chmod +x track-costs.sh # Run in background while training ./track-costs.sh 0.70 & ``` ## Next Steps After Smoke Test 1. **Validate LoRA quality** - Generate test videos 2. **Full pilot training** - Upload full 601-clip dataset 3. **Production training** - Scale to full episode pool ## Troubleshooting ### CUDA Out of Memory Reduce batch size in `train.py`: ```python batch_size = 1 # Already minimal gradient_accumulation_steps = 8 # Increase from 4 ``` ### Download Too Slow Use `aria2c` for parallel downloads: ```bash pip install aria2p aria2c -x 16 -s 16 ``` ### Training Crashes Check GPU memory: ```bash watch -n 5 nvidia-smi ``` If >95% used, reduce resolution: ```bash --height 384 --width 640 ```