Guide incident response from detection to post-mortem using SRE principles, severity classification, on-call management, blameless culture, and communication protocols. Use when setting up incident processes, designing escalation policies, or conducting post-mortems.
View on GitHubancoleman/ai-design-components
backend-ai-skills
February 1, 2026
Select agents to install to:
npx add-skill https://github.com/ancoleman/ai-design-components/blob/main/skills/managing-incidents/SKILL.md -a claude-code --skill managing-incidentsInstallation paths:
.claude/skills/managing-incidents/# Incident Management Provide end-to-end incident management guidance covering detection, response, communication, and learning. Emphasizes SRE culture, blameless post-mortems, and structured processes for high-reliability operations. ## When to Use This Skill Apply this skill when: - Setting up incident response processes for a team - Designing on-call rotations and escalation policies - Creating runbooks for common failure scenarios - Conducting blameless post-mortems after incidents - Implementing incident communication protocols (internal and external) - Choosing incident management tooling and platforms - Improving MTTR and incident frequency metrics ## Core Principles ### Incident Management Philosophy **Declare Early and Often:** Do not wait for certainty. Declaring an incident enables coordination, can be downgraded if needed, and prevents delayed response. **Mitigation First, Root Cause Later:** Stop customer impact immediately (rollback, disable feature, failover). Debug and fix root cause after stability restored. **Blameless Culture:** Assume good intentions. Focus on how systems failed, not who failed. Create psychological safety for honest learning. **Clear Command Structure:** Assign Incident Commander (IC) to own coordination. IC delegates tasks but does not do hands-on debugging. **Communication is Critical:** Internal coordination via dedicated channels, external transparency via status pages. Update stakeholders every 15-30 minutes during critical incidents. ## Severity Classification Standard severity levels with response times: **SEV0 (P0) - Critical Outage:** - Impact: Complete service outage, critical data loss, payment processing down - Response: Page immediately 24/7, all hands on deck, executive notification - Example: API completely down, entire customer base affected **SEV1 (P1) - Major Degradation:** - Impact: Major functionality degraded, significant customer subset affected - Response: Page during business hours, escalate