Optimize CUDA kernels and GPU code for performance. Use when reviewing CUDA code, analyzing performance, or suggesting GPU optimizations.
View on GitHubkeith-mvs/plugin-nsight-copilot
nsight-copilot
February 1, 2026
Select agents to install to:
npx add-skill https://github.com/keith-mvs/plugin-nsight-copilot/blob/main/./skills/cuda-optimization/SKILL.md -a claude-code --skill cuda-optimizationInstallation paths:
.claude/skills/cuda-optimization/# CUDA Optimization Skill
This skill helps optimize CUDA and GPU code for better performance.
## When I activate
I automatically activate when you:
- Review CUDA kernel code (`.cu` files)
- Ask about GPU performance or optimization
- Mention memory coalescing, occupancy, or shared memory
- Request profiling analysis or bottleneck identification
- Discuss parallel algorithm efficiency
## What I do
### Performance Analysis
I analyze CUDA code for:
- **Memory access patterns**: Detect uncoalesced accesses, bank conflicts
- **Thread configuration**: Evaluate block size, grid size, occupancy
- **Synchronization**: Check for unnecessary synchronization points
- **Memory hierarchy**: Assess use of shared memory, constant memory, texture memory
- **Warp efficiency**: Identify divergence and suboptimal thread utilization
### Optimization Suggestions
I provide specific recommendations:
- Coalescing memory accesses
- Optimal thread block configurations
- Shared memory usage patterns
- Reduction strategies
- Stream parallelism opportunities
- Compute vs memory-bound analysis
### Code Examples
I show before/after code examples demonstrating:
```cuda
// ❌ Uncoalesced access
__global__ void slow(float* data, int stride) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
data[idx * stride] = idx; // Poor pattern
}
// ✅ Coalesced access
__global__ void fast(float* data) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
data[idx] = idx; // Sequential pattern
}
```
## Tools I use
I leverage:
- NVIDIA Nsight Copilot (GPT-OSS-120B) for deep CUDA expertise
- Static code analysis for pattern detection
- Architecture-specific optimization knowledge (Ampere, Hopper, Ada)
- Best practices from CUDA Programming Guide
## Output format
My suggestions include:
1. **Issue identification** with line numbers
2. **Performance impact** estimate (low/medium/high)
3. **Specific fix** with code example
4. **Architecture notes** if relevant
Focus on actionable, me