Back to Skills

gpu-optimization

verified

Use this skill when optimizing GPU training efficiency. Covers memory optimization, mixed precision, gradient accumulation, model parallelism (TP/PP/DP), DeepSpeed, and FSDP integration.

View on GitHub

Marketplace

everything-claude-code

yxbian23/ai-research-claude-code

Plugin

everything-claude-code

workflow

Repository

yxbian23/ai-research-claude-code

skills/gpu-optimization/SKILL.md

Last Verified

January 25, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/yxbian23/ai-research-claude-code/blob/main/skills/gpu-optimization/SKILL.md -a claude-code --skill gpu-optimization

Installation paths:

Claude
.claude/skills/gpu-optimization/
Powered by add-skill CLI

Instructions

# GPU Optimization

This skill provides comprehensive guidance for optimizing GPU training efficiency and handling large models.

## When to Activate

- Training runs out of GPU memory
- Need to scale training to multiple GPUs
- Optimizing training throughput
- Implementing model parallelism
- Using DeepSpeed or FSDP

## Memory Optimization Techniques

### 1. Mixed Precision Training

```python
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    optimizer.zero_grad()

    # Forward pass in fp16/bf16
    with autocast(dtype=torch.bfloat16):
        outputs = model(batch["input"])
        loss = criterion(outputs, batch["target"])

    # Backward pass with scaling
    scaler.scale(loss).backward()

    # Unscale for gradient clipping
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    # Optimizer step
    scaler.step(optimizer)
    scaler.update()
```

### 2. Gradient Checkpointing

```python
from torch.utils.checkpoint import checkpoint, checkpoint_sequential

class CheckpointedTransformer(nn.Module):
    def __init__(self, num_layers: int, dim: int):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerBlock(dim) for _ in range(num_layers)
        ])
        self.gradient_checkpointing = False

    def enable_gradient_checkpointing(self):
        self.gradient_checkpointing = True

    def forward(self, x):
        if self.gradient_checkpointing and self.training:
            # Checkpoint every layer
            for layer in self.layers:
                x = checkpoint(layer, x, use_reentrant=False)
        else:
            for layer in self.layers:
                x = layer(x)
        return x

# Or checkpoint sequential blocks
x = checkpoint_sequential(self.layers, segments=4, input=x)
```

### 3. Gradient Accumulation

```python
accumulation_steps = 4
optimizer.zero_grad()

for i, batch in enumerate(dataloader):
    # Forward pa

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
10128 chars