Use when: "monitoring", "observability", "metrics", "tracing", "OpenTelemetry", "alerts", "runbooks", "SLO", "SLI", "dashboards", "Prometheus", "Grafana", "performance", "latency", "throughput", "Core Web Vitals", "load testing".
View on GitHubboshu2/agentops
domain-kit
January 24, 2026
Select agents to install to:
npx add-skill https://github.com/boshu2/agentops/blob/main/plugins/domain-kit/skills/monitoring/SKILL.md -a claude-code --skill monitoringInstallation paths:
.claude/skills/monitoring/# Monitoring Skill
Observability, alerting, and performance engineering patterns.
## Quick Reference
| Area | Key Patterns | When to Use |
|------|--------------|-------------|
| **Alerts & Runbooks** | Actionable alerts, runbook linking | Alert setup |
| **Performance** | OpenTelemetry, tracing, load testing | Optimization |
---
## Alerting Best Practices
### Alert Design Principles
| Principle | Good | Bad |
|-----------|------|-----|
| **Actionable** | "Database connections > 90%" | "Something wrong" |
| **Specific** | "Order service latency p99 > 500ms" | "Slow" |
| **Documented** | Links to runbook | No context |
| **Appropriate** | Pages for real issues | Alert fatigue |
### Alert Severity Levels
| Level | Response | Examples |
|-------|----------|----------|
| **Critical** | Page immediately | Service down, data loss |
| **Warning** | Check within 1 hour | Degraded performance |
| **Info** | Review daily | Capacity planning |
### Alert Template
```yaml
# Prometheus alert example
groups:
- name: service-alerts
rules:
- alert: HighLatency
expr: histogram_quantile(0.99, http_request_duration_seconds_bucket) > 0.5
for: 5m
labels:
severity: warning
service: api
annotations:
summary: "High latency on {{ $labels.service }}"
description: "p99 latency is {{ $value }}s (threshold: 0.5s)"
runbook_url: "https://wiki/runbooks/high-latency"
dashboard_url: "https://grafana/d/api-latency"
```
### Runbook-Linked Alerts
Every alert should link to:
1. **Runbook** - What to do
2. **Dashboard** - Where to look
3. **Escalation** - Who to contact
```markdown
## Alert: HighLatency
### Quick Check
- Dashboard: [link]
- Recent deploys: `git log --since="1 hour ago"`
### Common Causes
1. **High traffic** → Scale horizontally
2. **Database slow** → Check connection pool
3. **Upstream delay** → Check dependency
### Resolution
[Step-by-step instructions]
### Escalation