Use this skill when optimizing GPU training efficiency. Covers memory optimization, mixed precision, gradient accumulation, model parallelism (TP/PP/DP), DeepSpeed, and FSDP integration.
View on GitHubyxbian23/ai-research-claude-code
everything-claude-code
skills/gpu-optimization/SKILL.md
January 25, 2026
Select agents to install to:
npx add-skill https://github.com/yxbian23/ai-research-claude-code/blob/main/skills/gpu-optimization/SKILL.md -a claude-code --skill gpu-optimizationInstallation paths:
.claude/skills/gpu-optimization/# GPU Optimization
This skill provides comprehensive guidance for optimizing GPU training efficiency and handling large models.
## When to Activate
- Training runs out of GPU memory
- Need to scale training to multiple GPUs
- Optimizing training throughput
- Implementing model parallelism
- Using DeepSpeed or FSDP
## Memory Optimization Techniques
### 1. Mixed Precision Training
```python
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
optimizer.zero_grad()
# Forward pass in fp16/bf16
with autocast(dtype=torch.bfloat16):
outputs = model(batch["input"])
loss = criterion(outputs, batch["target"])
# Backward pass with scaling
scaler.scale(loss).backward()
# Unscale for gradient clipping
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Optimizer step
scaler.step(optimizer)
scaler.update()
```
### 2. Gradient Checkpointing
```python
from torch.utils.checkpoint import checkpoint, checkpoint_sequential
class CheckpointedTransformer(nn.Module):
def __init__(self, num_layers: int, dim: int):
super().__init__()
self.layers = nn.ModuleList([
TransformerBlock(dim) for _ in range(num_layers)
])
self.gradient_checkpointing = False
def enable_gradient_checkpointing(self):
self.gradient_checkpointing = True
def forward(self, x):
if self.gradient_checkpointing and self.training:
# Checkpoint every layer
for layer in self.layers:
x = checkpoint(layer, x, use_reentrant=False)
else:
for layer in self.layers:
x = layer(x)
return x
# Or checkpoint sequential blocks
x = checkpoint_sequential(self.layers, segments=4, input=x)
```
### 3. Gradient Accumulation
```python
accumulation_steps = 4
optimizer.zero_grad()
for i, batch in enumerate(dataloader):
# Forward pa