Back to Skills

sre-incident-response

verified

Use when responding to production incidents following SRE principles and best practices.

View on GitHub

Marketplace

han

TheBushidoCollective/han

Plugin

do-site-reliability-engineering

Discipline

Repository

TheBushidoCollective/han
60stars

do/do-site-reliability-engineering/skills/sre-incident-response/SKILL.md

Last Verified

January 24, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/TheBushidoCollective/han/blob/main/do/do-site-reliability-engineering/skills/sre-incident-response/SKILL.md -a claude-code --skill sre-incident-response

Installation paths:

Claude
.claude/skills/sre-incident-response/
Powered by add-skill CLI

Instructions

# SRE Incident Response

Managing incidents and conducting effective postmortems.

## Incident Severity Levels

### P0 - Critical

- **Impact**: Service completely down or major functionality unavailable
- **Response**: Immediate, all-hands
- **Communication**: Every 30 minutes
- **Examples**: Complete outage, data loss, security breach

### P1 - High

- **Impact**: Significant degradation affecting many users
- **Response**: Immediate, primary on-call
- **Communication**: Every hour
- **Examples**: Elevated error rates, slow response times

### P2 - Medium

- **Impact**: Minor degradation or single component affected
- **Response**: Next business day
- **Communication**: Daily updates
- **Examples**: Single region issue, non-critical feature down

### P3 - Low

- **Impact**: No user impact yet, potential future issue
- **Response**: Track in backlog
- **Communication**: Async
- **Examples**: Monitoring gaps, capacity warnings

## Incident Response Process

### 1. Detection

```
Alert fires → On-call acknowledges → Initial assessment
```

### 2. Triage

```
- Assess severity
- Page additional responders if needed
- Establish incident channel
- Assign incident commander
```

### 3. Mitigation

```
- Identify mitigation options
- Execute fastest safe mitigation
- Monitor for improvement
- Escalate if not improving
```

### 4. Resolution

```
- Verify service health
- Communicate resolution
- Document actions taken
- Schedule postmortem
```

### 5. Follow-up

```
- Conduct postmortem
- Identify action items
- Track completion
- Update runbooks
```

## Incident Roles

### Incident Commander (IC)

- Owns incident response
- Makes decisions
- Coordinates responders
- Manages communication
- Declares incident resolved

### Operations Lead

- Executes technical remediation
- Proposes mitigation strategies
- Implements fixes
- Tests changes

### Communications Lead

- Updates status page
- Posts to incident channel
- Notifies stakeholders
- Prepares external messaging

### P

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
6069 chars