Use when building comprehensive monitoring and observability systems.
View on GitHubTheBushidoCollective/han
do-site-reliability-engineering
January 24, 2026
Select agents to install to:
npx add-skill https://github.com/TheBushidoCollective/han/blob/main/do/do-site-reliability-engineering/skills/sre-monitoring/SKILL.md -a claude-code --skill sre-monitoring-and-observabilityInstallation paths:
.claude/skills/sre-monitoring-and-observability/# SRE Monitoring and Observability
Building comprehensive monitoring and observability systems.
## Four Golden Signals
### Latency
Time to process requests:
```prometheus
# Request duration
http_request_duration_seconds
# Query
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
```
### Traffic
Demand on the system:
```prometheus
# Requests per second
rate(http_requests_total[5m])
# By endpoint
sum(rate(http_requests_total[5m])) by (endpoint)
```
### Errors
Rate of failed requests:
```prometheus
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m])
# SLI compliance
1 - (error_rate / slo_target)
```
### Saturation
Resource utilization:
```prometheus
# CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100
```
## Service Level Indicators (SLIs)
### Availability SLI
```prometheus
# Successful requests / Total requests
sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total[30d]))
```
### Latency SLI
```prometheus
# Requests faster than threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))
```
### Throughput SLI
```prometheus
# Requests processed within capacity
clamp_max(
rate(http_requests_total[5m]) / capacity_requests_per_second,
1.0
)
```
## Alerting
### Alert Severity Levels
**P0 - Critical**: Service down or severe degradation
**P1 - High**: Significant impact, error budget at risk
**P2 - Medium**: Degradation, not user-facing yet
**P3 - Low**: Awareness, no immediate action needed
### Example Alerts
```yaml
# High error rate
groups:
- name: sre
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m])
> 0.05
Issues Found: