Use when responding to production incidents following SRE principles and best practices.
View on GitHubTheBushidoCollective/han
do-site-reliability-engineering
do/do-site-reliability-engineering/skills/sre-incident-response/SKILL.md
January 24, 2026
Select agents to install to:
npx add-skill https://github.com/TheBushidoCollective/han/blob/main/do/do-site-reliability-engineering/skills/sre-incident-response/SKILL.md -a claude-code --skill sre-incident-responseInstallation paths:
.claude/skills/sre-incident-response/# SRE Incident Response Managing incidents and conducting effective postmortems. ## Incident Severity Levels ### P0 - Critical - **Impact**: Service completely down or major functionality unavailable - **Response**: Immediate, all-hands - **Communication**: Every 30 minutes - **Examples**: Complete outage, data loss, security breach ### P1 - High - **Impact**: Significant degradation affecting many users - **Response**: Immediate, primary on-call - **Communication**: Every hour - **Examples**: Elevated error rates, slow response times ### P2 - Medium - **Impact**: Minor degradation or single component affected - **Response**: Next business day - **Communication**: Daily updates - **Examples**: Single region issue, non-critical feature down ### P3 - Low - **Impact**: No user impact yet, potential future issue - **Response**: Track in backlog - **Communication**: Async - **Examples**: Monitoring gaps, capacity warnings ## Incident Response Process ### 1. Detection ``` Alert fires → On-call acknowledges → Initial assessment ``` ### 2. Triage ``` - Assess severity - Page additional responders if needed - Establish incident channel - Assign incident commander ``` ### 3. Mitigation ``` - Identify mitigation options - Execute fastest safe mitigation - Monitor for improvement - Escalate if not improving ``` ### 4. Resolution ``` - Verify service health - Communicate resolution - Document actions taken - Schedule postmortem ``` ### 5. Follow-up ``` - Conduct postmortem - Identify action items - Track completion - Update runbooks ``` ## Incident Roles ### Incident Commander (IC) - Owns incident response - Makes decisions - Coordinates responders - Manages communication - Declares incident resolved ### Operations Lead - Executes technical remediation - Proposes mitigation strategies - Implements fixes - Tests changes ### Communications Lead - Updates status page - Posts to incident channel - Notifies stakeholders - Prepares external messaging ### P