Use when building reliable and scalable distributed systems.
View on GitHubTheBushidoCollective/han
do-site-reliability-engineering
January 24, 2026
Select agents to install to:
npx add-skill https://github.com/TheBushidoCollective/han/blob/main/do/do-site-reliability-engineering/skills/sre-reliability/SKILL.md -a claude-code --skill sre-reliability-engineeringInstallation paths:
.claude/skills/sre-reliability-engineering/# SRE Reliability Engineering
Building reliable and scalable distributed systems.
## Service Level Objectives (SLOs)
### Defining SLOs
```
SLI: Availability = successful requests / total requests
SLO: 99.9% availability (measured over 30 days)
Error Budget: 0.1% = 43 minutes downtime per month
```
### SLO Document Template
```markdown
# API Service SLO
## Availability SLO
**Target**: 99.9% of requests succeed (measured over 30 days)
**SLI Definition**:
- Success: HTTP 200-399 responses
- Failure: HTTP 500-599 responses, timeouts
- Excluded: HTTP 400-499 (client errors)
**Measurement**:
```prometheus
sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total{status!~"4.."}[30d]))
```
**Error Budget**: 0.1% = ~43 minutes/month
**Consequences**:
- Budget remaining > 0: Ship features fast
- Budget exhausted: Feature freeze, focus on reliability
- Budget at 50%: Increase caution
```
## Error Budgets
### Tracking
```prometheus
# Error budget remaining
error_budget_remaining = 1 - (
(1 - current_sli) / (1 - slo_target)
)
# Example: 99.9% SLO, currently at 99.95%
# Error budget remaining = 1 - ((1 - 0.9995) / (1 - 0.999))
# = 1 - (0.0005 / 0.001) = 0.5 (50% remaining)
```
### Burn Rate
```prometheus
# How fast are we consuming error budget?
error_budget_burn_rate =
(1 - current_sli_1h) / (1 - slo_target)
# Alert if burning budget 10x faster than sustainable
- alert: FastErrorBudgetBurn
expr: error_budget_burn_rate > 10
for: 1h
```
### Policy
```
Error Budget > 75%: Ship aggressively
Error Budget 25-75%: Normal velocity
Error Budget < 25%: Slow down, increase testing
Error Budget = 0%: Feature freeze, reliability only
```
## Reliability Patterns
### Circuit Breaker
```javascript
class CircuitBreaker {
constructor({ threshold = 5, timeout = 60000 }) {
this.state = 'CLOSED';
this.failures = 0;
this.threshold = threshold;
this.timeout = timeout;
}
async call(fn) {
if (this.state === 'OIssues Found: