Provide architecture-specific GPU optimization advice for NVIDIA GPUs (Ampere, Hopper, Ada). Use when discussing GPU architecture features, compute capability, or hardware-specific optimizations.
View on GitHubkeith-mvs/plugin-nsight-copilot
nsight-copilot
February 1, 2026
Select agents to install to:
npx add-skill https://github.com/keith-mvs/plugin-nsight-copilot/blob/main/./skills/gpu-architecture-advisor/SKILL.md -a claude-code --skill gpu-architecture-advisorInstallation paths:
.claude/skills/gpu-architecture-advisor/# GPU Architecture Advisor Skill This skill provides architecture-specific optimization guidance for NVIDIA GPUs. ## When I activate I automatically activate when you: - Mention specific GPU architectures (Ampere, Hopper, Ada, Turing) - Ask about compute capability requirements - Discuss Tensor Cores, RT Cores, or specialized hardware - Need architecture-specific optimization advice - Compare performance across GPU generations ## GPU Architectures I Know ### Hopper (Compute 9.0) - **Tensor Cores**: 4th gen, FP8 support, Transformer Engine - **Thread Block Clusters**: New hierarchy level - **DPX instructions**: Dynamic programming acceleration - **Async execution**: Enhanced asynchronous pipeline - **L2 cache**: Larger, more configurable - **Target GPUs**: H100, H200 ### Ada Lovelace (Compute 8.9) - **Tensor Cores**: 4th gen with FP8 - **Shader Execution Reordering**: Dynamic scheduling - **DLSS 3**: Optical flow acceleration - **RT Cores**: 3rd gen ray tracing - **Target GPUs**: RTX 4090, RTX 4080, L40 ### Ampere (Compute 8.0, 8.6) - **Tensor Cores**: 3rd gen, TF32, BF16 support - **Unified memory**: Improved page migration - **Multi-instance GPU**: Hardware partitioning - **Async copy**: Dedicated copy engines - **Target GPUs**: A100, A30, RTX 3090, RTX 3080 ### Turing (Compute 7.5) - **Tensor Cores**: 2nd gen, INT8/INT4 support - **RT Cores**: 1st gen ray tracing - **Target GPUs**: RTX 2080, T4 ## What I provide ### Architecture-Specific Optimization I suggest optimizations leveraging: - Tensor Core operations (WMMA, CUTLASS) - Optimal warp sizes for architecture - Cache hierarchy utilization - Compute capability-specific features - Memory bandwidth characteristics ### Code Examples ```cuda // Ampere+ TF32 automatic conversion #if __CUDA_ARCH__ >= 800 // TF32 mode automatically accelerates FP32 on Tensor Cores #endif // Hopper async pipeline #if __CUDA_ARCH__ >= 900 cuda::pipeline<cuda::thread_scope_thread> pipe; #endif ``` ### Feature Detecti