Use when: "incident", "outage", "production issue", "debug", "logs", "error", "postmortem", "root cause", "triage", "escalation", "P0", "P1", "on-call", "mitigation", "rollback".
View on GitHubboshu2/agentops
domain-kit
January 24, 2026
Select agents to install to:
npx add-skill https://github.com/boshu2/agentops/blob/main/plugins/domain-kit/skills/operations/SKILL.md -a claude-code --skill operationsInstallation paths:
.claude/skills/operations/# Operations Skill Incident response, debugging, and postmortem patterns. ## Quick Reference | Area | Key Patterns | When to Use | |------|--------------|-------------| | **Incident Response** | Triage, mitigation, communication | Production issues | | **Structured Response** | Procedures, escalation | Incident management | | **Postmortems** | Root cause, blameless | After incidents | | **Error Detection** | Log analysis, patterns | Debugging | --- ## Incident Response ### Severity Levels | Level | Impact | Response Time | Examples | |-------|--------|---------------|----------| | **P0** | Complete outage | Immediate | Site down, data loss | | **P1** | Major functionality broken | < 1 hour | Core feature broken | | **P2** | Significant issues | < 4 hours | Degraded performance | | **P3** | Minor issues | Next business day | UI bugs, minor errors | ### Immediate Actions (First 5 Minutes) 1. **Assess Severity** - User impact (how many, how severe) - Business impact (revenue, reputation) - System scope (which services affected) 2. **Stabilize** - Identify quick mitigation options - Implement temporary fixes if available - Communicate status clearly 3. **Gather Data** - Recent deployments or changes - Error logs and metrics - Similar past incidents ### Investigation Protocol ```markdown ## Log Analysis 1. Start with error aggregation 2. Identify error patterns 3. Trace to root cause 4. Check cascading failures ## Quick Fixes (in order of preference) 1. Rollback if recent deployment 2. Increase resources if load-related 3. Disable problematic features 4. Implement circuit breakers ## Communication - Brief status updates every 15 minutes - Technical details for engineers - Business impact for stakeholders - ETA when reasonable to estimate ``` ### Fix Implementation 1. Minimal viable fix first 2. Test in staging if possible 3. Roll out with monitoring 4. Prepare rollback plan 5. Document changes made --- ## Structured Incident