Back to Skills

monitoring

verified

Use when: "monitoring", "observability", "metrics", "tracing", "OpenTelemetry", "alerts", "runbooks", "SLO", "SLI", "dashboards", "Prometheus", "Grafana", "performance", "latency", "throughput", "Core Web Vitals", "load testing".

View on GitHub

Marketplace

agentops-marketplace

boshu2/agentops

Plugin

domain-kit

development

Repository

boshu2/agentops
6stars

plugins/domain-kit/skills/monitoring/SKILL.md

Last Verified

January 24, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/boshu2/agentops/blob/main/plugins/domain-kit/skills/monitoring/SKILL.md -a claude-code --skill monitoring

Installation paths:

Claude
.claude/skills/monitoring/
Powered by add-skill CLI

Instructions

# Monitoring Skill

Observability, alerting, and performance engineering patterns.

## Quick Reference

| Area | Key Patterns | When to Use |
|------|--------------|-------------|
| **Alerts & Runbooks** | Actionable alerts, runbook linking | Alert setup |
| **Performance** | OpenTelemetry, tracing, load testing | Optimization |

---

## Alerting Best Practices

### Alert Design Principles

| Principle | Good | Bad |
|-----------|------|-----|
| **Actionable** | "Database connections > 90%" | "Something wrong" |
| **Specific** | "Order service latency p99 > 500ms" | "Slow" |
| **Documented** | Links to runbook | No context |
| **Appropriate** | Pages for real issues | Alert fatigue |

### Alert Severity Levels

| Level | Response | Examples |
|-------|----------|----------|
| **Critical** | Page immediately | Service down, data loss |
| **Warning** | Check within 1 hour | Degraded performance |
| **Info** | Review daily | Capacity planning |

### Alert Template

```yaml
# Prometheus alert example
groups:
  - name: service-alerts
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.99, http_request_duration_seconds_bucket) > 0.5
        for: 5m
        labels:
          severity: warning
          service: api
        annotations:
          summary: "High latency on {{ $labels.service }}"
          description: "p99 latency is {{ $value }}s (threshold: 0.5s)"
          runbook_url: "https://wiki/runbooks/high-latency"
          dashboard_url: "https://grafana/d/api-latency"
```

### Runbook-Linked Alerts

Every alert should link to:
1. **Runbook** - What to do
2. **Dashboard** - Where to look
3. **Escalation** - Who to contact

```markdown
## Alert: HighLatency

### Quick Check
- Dashboard: [link]
- Recent deploys: `git log --since="1 hour ago"`

### Common Causes
1. **High traffic** → Scale horizontally
2. **Database slow** → Check connection pool
3. **Upstream delay** → Check dependency

### Resolution
[Step-by-step instructions]

### Escalation

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
6557 chars