Back to Skills

operations

verified

Use when: "incident", "outage", "production issue", "debug", "logs", "error", "postmortem", "root cause", "triage", "escalation", "P0", "P1", "on-call", "mitigation", "rollback".

View on GitHub

Marketplace

agentops-marketplace

boshu2/agentops

Plugin

domain-kit

development

Repository

boshu2/agentops
6stars

plugins/domain-kit/skills/operations/SKILL.md

Last Verified

January 24, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/boshu2/agentops/blob/main/plugins/domain-kit/skills/operations/SKILL.md -a claude-code --skill operations

Installation paths:

Claude
.claude/skills/operations/
Powered by add-skill CLI

Instructions

# Operations Skill

Incident response, debugging, and postmortem patterns.

## Quick Reference

| Area | Key Patterns | When to Use |
|------|--------------|-------------|
| **Incident Response** | Triage, mitigation, communication | Production issues |
| **Structured Response** | Procedures, escalation | Incident management |
| **Postmortems** | Root cause, blameless | After incidents |
| **Error Detection** | Log analysis, patterns | Debugging |

---

## Incident Response

### Severity Levels

| Level | Impact | Response Time | Examples |
|-------|--------|---------------|----------|
| **P0** | Complete outage | Immediate | Site down, data loss |
| **P1** | Major functionality broken | < 1 hour | Core feature broken |
| **P2** | Significant issues | < 4 hours | Degraded performance |
| **P3** | Minor issues | Next business day | UI bugs, minor errors |

### Immediate Actions (First 5 Minutes)

1. **Assess Severity**
   - User impact (how many, how severe)
   - Business impact (revenue, reputation)
   - System scope (which services affected)

2. **Stabilize**
   - Identify quick mitigation options
   - Implement temporary fixes if available
   - Communicate status clearly

3. **Gather Data**
   - Recent deployments or changes
   - Error logs and metrics
   - Similar past incidents

### Investigation Protocol

```markdown
## Log Analysis
1. Start with error aggregation
2. Identify error patterns
3. Trace to root cause
4. Check cascading failures

## Quick Fixes (in order of preference)
1. Rollback if recent deployment
2. Increase resources if load-related
3. Disable problematic features
4. Implement circuit breakers

## Communication
- Brief status updates every 15 minutes
- Technical details for engineers
- Business impact for stakeholders
- ETA when reasonable to estimate
```

### Fix Implementation

1. Minimal viable fix first
2. Test in staging if possible
3. Roll out with monitoring
4. Prepare rollback plan
5. Document changes made

---

## Structured Incident 

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
6719 chars