Back to Skills

runbooks-incident-response

verified

Use when creating incident response procedures and on-call playbooks. Covers incident management, communication protocols, and post-mortem documentation.

View on GitHub

Marketplace

han

TheBushidoCollective/han

Plugin

jutsu-runbooks

Technique

Repository

TheBushidoCollective/han
60stars

jutsu/jutsu-runbooks/skills/incident-response/SKILL.md

Last Verified

January 24, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/TheBushidoCollective/han/blob/main/jutsu/jutsu-runbooks/skills/incident-response/SKILL.md -a claude-code --skill runbooks-incident-response

Installation paths:

Claude
.claude/skills/runbooks-incident-response/
Powered by add-skill CLI

Instructions

# Runbooks - Incident Response

Creating effective incident response procedures for handling production incidents and on-call scenarios.

## Incident Response Framework

### Incident Severity Levels

**SEV-1 (Critical)**

- Complete service outage
- Data loss or security breach
- Major customer impact (>50% of users)
- **Response Time:** Immediate
- **Escalation:** Page on-call + manager

**SEV-2 (High)**

- Partial service degradation
- Affecting significant users (10-50%)
- Performance issues (>50% slower)
- **Response Time:** Within 15 minutes
- **Escalation:** Page on-call

**SEV-3 (Medium)**

- Minor degradation
- Affecting few users (<10%)
- Non-critical features broken
- **Response Time:** Within 1 hour
- **Escalation:** On-call handles during business hours

**SEV-4 (Low)**

- Cosmetic issues
- Internal tools affected
- No customer impact
- **Response Time:** Next business day
- **Escalation:** Create ticket, no page

## Incident Response Template

```markdown
# Incident Response: [Alert/Issue Name]

**Severity:** SEV-1/SEV-2/SEV-3/SEV-4
**Response Time:** Immediate / 15 min / 1 hour / Next day
**Owner:** On-call Engineer

## Incident Detection

**This runbook is triggered by:**
- PagerDuty alert: `api_error_rate_high`
- Customer report in #support
- Monitoring dashboard showing anomaly

## Initial Response (First 5 Minutes)

### 1. Acknowledge & Assess

```bash
# Check current status
curl https://api.example.com/health
kubectl get pods -n production
```

**Determine severity:**

- All requests failing → SEV-1
- Partial failures → SEV-2
- Performance degraded → SEV-3

### 2. Notify Stakeholders

**SEV-1:**

- Create Slack incident channel: `/incident create SEV-1 API Outage`
- Page engineering manager
- Notify customer success team

**SEV-2:**

- Post in #incidents channel
- Tag on-call team

**SEV-3:**

- Post in #engineering channel
- No pages needed

### 3. Start Incident Timeline

Create incident doc (copy template):

```
Incident: API Outage
Started: 20

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
12037 chars

Issues Found:

  • name_directory_mismatch