runbooks-troubleshooting-guides

# Runbooks - Troubleshooting Guides

Creating effective troubleshooting guides for diagnosing and resolving operational issues.

## Troubleshooting Framework

### The 5-Step Method

1. **Observe** - Gather symptoms and data
2. **Hypothesize** - Form theories about root cause
3. **Test** - Validate hypotheses with experiments
4. **Fix** - Apply solution
5. **Verify** - Confirm resolution

## Basic Troubleshooting Guide

```markdown
# Troubleshooting: [Problem Statement]

## Symptoms

What the user/system is experiencing:
- API returning 503 errors
- Response time > 10 seconds
- High CPU usage alerts

## Quick Checks (< 2 minutes)

### 1. Is the service running?
```bash
kubectl get pods -n production | grep api-server
```

**Expected:** STATUS = Running

### 2. Are recent deploys the cause?

```bash
kubectl rollout history deployment/api-server
```

**Check:** Did we deploy in the last 30 minutes?

### 3. Is this affecting all users?

Check error rate in Datadog:

- If < 5%: Isolated issue, may be client-specific
- If > 50%: Widespread issue, likely infrastructure

## Common Causes

| Symptom | Likely Cause | Quick Fix |
|---------|-------------|-----------|
| 503 errors | Pod crashlooping | Restart deployment |
| Slow responses | Database connection pool | Increase pool size |
| High memory | Memory leak | Restart pods |

## Detailed Diagnosis

### Hypothesis 1: Database Connection Issues

**Test:**

```bash
# Check database connections
kubectl exec -it api-server-abc -- psql -h $DB_HOST -c "SELECT count(*) FROM pg_stat_activity"
```

**If connections > 90:** Pool is saturated.
**Next step:** Increase pool size or investigate slow queries.

### Hypothesis 2: High Traffic Spike

**Test:**

```bash
# Check request rate
curl -H "Authorization: Bearer $DD_API_KEY" \
  "https://api.datadoghq.com/api/v1/query?query=sum:nginx.requests{*}"
```

**If requests 3x normal:** Traffic spike.
**Next step:** Scale up pods or enable rate limiting.

### Hypothesis 3: External Service
runbooks-troubleshooting-guides

Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details