Use when designing incident management processes, creating runbooks, or establishing on-call practices. Covers incident lifecycle, communication, and postmortems.
View on GitHubmelodic-software/claude-code-plugins
systems-design
January 21, 2026
Select agents to install to:
npx add-skill https://github.com/melodic-software/claude-code-plugins/blob/main/plugins/systems-design/skills/incident-response/SKILL.md -a claude-code --skill incident-responseInstallation paths:
.claude/skills/incident-response/# Incident Response Patterns and practices for effective incident management, from detection through postmortem. ## When to Use This Skill - Designing incident response processes - Creating incident runbooks - Establishing on-call rotations - Running effective postmortems - Improving mean time to recovery (MTTR) ## Incident Lifecycle ```text ┌─────────────────────────────────────────────────────────┐ │ INCIDENT LIFECYCLE │ │ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Detect │─►│ Respond │─►│ Recover │─►│ Learn │ │ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ Alerting Triage & Mitigation Postmortem │ │ Monitoring Diagnosis Remediation Action Items │ └─────────────────────────────────────────────────────────┘ ``` ### Key Metrics ```text MTTD - Mean Time to Detect └── Time from incident start to detection MTTA - Mean Time to Acknowledge └── Time from alert to human acknowledgment MTTR - Mean Time to Recover └── Time from detection to resolution MTTF - Mean Time to Failure └── Time between incidents (reliability) Goal: Minimize MTTD + MTTA + MTTR ``` ## Incident Severity ### Severity Levels ```text SEV 1 - Critical ├── Complete outage ├── Data loss or security breach ├── All/most users affected ├── Response: Immediate (24/7) └── Example: Production database down SEV 2 - High ├── Major functionality impaired ├── Significant user impact ├── Workaround may exist ├── Response: Urgent (business hours++) └── Example: Payment processing degraded SEV 3 - Medium ├── Partial functionality affected ├── Limited user impact ├── Workaround available ├── Response: Normal priority └── Example: Report generation slow SEV 4 - Low ├── Minor issue ├── Minimal user impact ├── Response: Best effort └