Back to Skills

sre-reliability-engineering

verified

Use when building reliable and scalable distributed systems.

View on GitHub

Marketplace

han

TheBushidoCollective/han

Plugin

do-site-reliability-engineering

Discipline

Repository

TheBushidoCollective/han
60stars

do/do-site-reliability-engineering/skills/sre-reliability/SKILL.md

Last Verified

January 24, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/TheBushidoCollective/han/blob/main/do/do-site-reliability-engineering/skills/sre-reliability/SKILL.md -a claude-code --skill sre-reliability-engineering

Installation paths:

Claude
.claude/skills/sre-reliability-engineering/
Powered by add-skill CLI

Instructions

# SRE Reliability Engineering

Building reliable and scalable distributed systems.

## Service Level Objectives (SLOs)

### Defining SLOs

```
SLI: Availability = successful requests / total requests
SLO: 99.9% availability (measured over 30 days)
Error Budget: 0.1% = 43 minutes downtime per month
```

### SLO Document Template

```markdown
# API Service SLO

## Availability SLO

**Target**: 99.9% of requests succeed (measured over 30 days)

**SLI Definition**: 
- Success: HTTP 200-399 responses
- Failure: HTTP 500-599 responses, timeouts
- Excluded: HTTP 400-499 (client errors)

**Measurement**: 
```prometheus
sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total{status!~"4.."}[30d]))
```

**Error Budget**: 0.1% = ~43 minutes/month

**Consequences**:

- Budget remaining > 0: Ship features fast
- Budget exhausted: Feature freeze, focus on reliability
- Budget at 50%: Increase caution

```

## Error Budgets

### Tracking

```prometheus
# Error budget remaining
error_budget_remaining = 1 - (
  (1 - current_sli) / (1 - slo_target)
)

# Example: 99.9% SLO, currently at 99.95%
# Error budget remaining = 1 - ((1 - 0.9995) / (1 - 0.999))
# = 1 - (0.0005 / 0.001) = 0.5 (50% remaining)
```

### Burn Rate

```prometheus
# How fast are we consuming error budget?
error_budget_burn_rate = 
  (1 - current_sli_1h) / (1 - slo_target)
  
# Alert if burning budget 10x faster than sustainable
- alert: FastErrorBudgetBurn
  expr: error_budget_burn_rate > 10
  for: 1h
```

### Policy

```
Error Budget > 75%: Ship aggressively
Error Budget 25-75%: Normal velocity
Error Budget < 25%: Slow down, increase testing
Error Budget = 0%: Feature freeze, reliability only
```

## Reliability Patterns

### Circuit Breaker

```javascript
class CircuitBreaker {
  constructor({ threshold = 5, timeout = 60000 }) {
    this.state = 'CLOSED';
    this.failures = 0;
    this.threshold = threshold;
    this.timeout = timeout;
  }
  
  async call(fn) {
    if (this.state === 'O

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
6471 chars

Issues Found:

  • name_directory_mismatch