monitoring

# Monitoring Skill

Observability, alerting, and performance engineering patterns.

## Quick Reference

| Area | Key Patterns | When to Use |
|------|--------------|-------------|
| **Alerts & Runbooks** | Actionable alerts, runbook linking | Alert setup |
| **Performance** | OpenTelemetry, tracing, load testing | Optimization |

---

## Alerting Best Practices

### Alert Design Principles

| Principle | Good | Bad |
|-----------|------|-----|
| **Actionable** | "Database connections > 90%" | "Something wrong" |
| **Specific** | "Order service latency p99 > 500ms" | "Slow" |
| **Documented** | Links to runbook | No context |
| **Appropriate** | Pages for real issues | Alert fatigue |

### Alert Severity Levels

| Level | Response | Examples |
|-------|----------|----------|
| **Critical** | Page immediately | Service down, data loss |
| **Warning** | Check within 1 hour | Degraded performance |
| **Info** | Review daily | Capacity planning |

### Alert Template

```yaml
# Prometheus alert example
groups:
  - name: service-alerts
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.99, http_request_duration_seconds_bucket) > 0.5
        for: 5m
        labels:
          severity: warning
          service: api
        annotations:
          summary: "High latency on {{ $labels.service }}"
          description: "p99 latency is {{ $value }}s (threshold: 0.5s)"
          runbook_url: "https://wiki/runbooks/high-latency"
          dashboard_url: "https://grafana/d/api-latency"
```

### Runbook-Linked Alerts

Every alert should link to:
1. **Runbook** - What to do
2. **Dashboard** - Where to look
3. **Escalation** - Who to contact

```markdown
## Alert: HighLatency

### Quick Check
- Dashboard: [link]
- Recent deploys: `git log --since="1 hour ago"`

### Common Causes
1. **High traffic** → Scale horizontally
2. **Database slow** → Check connection pool
3. **Upstream delay** → Check dependency

### Resolution
[Step-by-step instructions]

### Escalation
Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details