ml-failfast-validation

# ML Fail-Fast Validation

POC validation patterns to catch issues before committing to long-running ML experiments.

## When to Use This Skill

Use this skill when:

- Starting a new ML experiment that will run for hours
- Validating model architecture before full training
- Checking gradient flow and data pipeline integrity
- Implementing POC validation checklists
- Debugging prediction collapse or gradient explosion issues

---

## 1. Why Fail-Fast?

| Without Fail-Fast         | With Fail-Fast         |
| ------------------------- | ---------------------- |
| Discover crash 4 hours in | Catch in 30 seconds    |
| Debug from cryptic error  | Clear error message    |
| Lose GPU time             | Validate before commit |
| Silent data issues        | Explicit schema checks |

**Principle**: Validate everything that can go wrong BEFORE the expensive computation.

---

## 2. POC Validation Checklist

### Minimum Viable POC (5 Checks)

```python
def run_poc_validation():
    """Fast validation before full experiment."""

    print("=" * 60)
    print("FAIL-FAST POC VALIDATION")
    print("=" * 60)

    # [1/5] Model instantiation
    print("\n[1/5] Model instantiation...")
    model = create_model(architecture, input_size=n_features)
    x = torch.randn(32, seq_len, n_features).to(device)
    out = model(x)
    assert out.shape == (32, 1), f"Output shape wrong: {out.shape}"
    print(f"   Input: (32, {seq_len}, {n_features}) -> Output: {out.shape}")
    print("   Status: PASS")

    # [2/5] Gradient flow
    print("\n[2/5] Gradient flow...")
    y = torch.randn(32, 1).to(device)
    loss = F.mse_loss(out, y)
    loss.backward()
    grad_norms = [p.grad.norm().item() for p in model.parameters() if p.grad is not None]
    assert len(grad_norms) > 0, "No gradients!"
    assert all(np.isfinite(g) for g in grad_norms), "NaN/Inf gradients!"
    print(f"   Max grad norm: {max(grad_norms):.4f}")
    print("   Status: PASS")

    # [3/5] NDJSON artifact validation
    print("
Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details