Back to Skills

sre-monitoring-and-observability

verified

Use when building comprehensive monitoring and observability systems.

View on GitHub

Marketplace

han

TheBushidoCollective/han

Plugin

do-site-reliability-engineering

Discipline

Repository

TheBushidoCollective/han
60stars

do/do-site-reliability-engineering/skills/sre-monitoring/SKILL.md

Last Verified

January 24, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/TheBushidoCollective/han/blob/main/do/do-site-reliability-engineering/skills/sre-monitoring/SKILL.md -a claude-code --skill sre-monitoring-and-observability

Installation paths:

Claude
.claude/skills/sre-monitoring-and-observability/
Powered by add-skill CLI

Instructions

# SRE Monitoring and Observability

Building comprehensive monitoring and observability systems.

## Four Golden Signals

### Latency

Time to process requests:

```prometheus
# Request duration
http_request_duration_seconds

# Query
histogram_quantile(0.95, 
  rate(http_request_duration_seconds_bucket[5m])
)
```

### Traffic

Demand on the system:

```prometheus
# Requests per second
rate(http_requests_total[5m])

# By endpoint
sum(rate(http_requests_total[5m])) by (endpoint)
```

### Errors

Rate of failed requests:

```prometheus
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
/ 
rate(http_requests_total[5m])

# SLI compliance
1 - (error_rate / slo_target)
```

### Saturation

Resource utilization:

```prometheus
# CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) 
/ node_memory_MemTotal_bytes * 100
```

## Service Level Indicators (SLIs)

### Availability SLI

```prometheus
# Successful requests / Total requests
sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total[30d]))
```

### Latency SLI

```prometheus
# Requests faster than threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))
```

### Throughput SLI

```prometheus
# Requests processed within capacity
clamp_max(
  rate(http_requests_total[5m]) / capacity_requests_per_second,
  1.0
)
```

## Alerting

### Alert Severity Levels

**P0 - Critical**: Service down or severe degradation
**P1 - High**: Significant impact, error budget at risk  
**P2 - Medium**: Degradation, not user-facing yet
**P3 - Low**: Awareness, no immediate action needed

### Example Alerts

```yaml
# High error rate
groups:
  - name: sre
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m])
          > 0.05
    

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
4016 chars

Issues Found:

  • name_directory_mismatch