Use when creating troubleshooting guides and diagnostic procedures for operational issues. Covers problem diagnosis, root cause analysis, and systematic debugging.
View on GitHubTheBushidoCollective/han
jutsu-runbooks
January 24, 2026
Select agents to install to:
npx add-skill https://github.com/TheBushidoCollective/han/blob/main/jutsu/jutsu-runbooks/skills/troubleshooting-guides/SKILL.md -a claude-code --skill runbooks-troubleshooting-guidesInstallation paths:
.claude/skills/runbooks-troubleshooting-guides/# Runbooks - Troubleshooting Guides
Creating effective troubleshooting guides for diagnosing and resolving operational issues.
## Troubleshooting Framework
### The 5-Step Method
1. **Observe** - Gather symptoms and data
2. **Hypothesize** - Form theories about root cause
3. **Test** - Validate hypotheses with experiments
4. **Fix** - Apply solution
5. **Verify** - Confirm resolution
## Basic Troubleshooting Guide
```markdown
# Troubleshooting: [Problem Statement]
## Symptoms
What the user/system is experiencing:
- API returning 503 errors
- Response time > 10 seconds
- High CPU usage alerts
## Quick Checks (< 2 minutes)
### 1. Is the service running?
```bash
kubectl get pods -n production | grep api-server
```
**Expected:** STATUS = Running
### 2. Are recent deploys the cause?
```bash
kubectl rollout history deployment/api-server
```
**Check:** Did we deploy in the last 30 minutes?
### 3. Is this affecting all users?
Check error rate in Datadog:
- If < 5%: Isolated issue, may be client-specific
- If > 50%: Widespread issue, likely infrastructure
## Common Causes
| Symptom | Likely Cause | Quick Fix |
|---------|-------------|-----------|
| 503 errors | Pod crashlooping | Restart deployment |
| Slow responses | Database connection pool | Increase pool size |
| High memory | Memory leak | Restart pods |
## Detailed Diagnosis
### Hypothesis 1: Database Connection Issues
**Test:**
```bash
# Check database connections
kubectl exec -it api-server-abc -- psql -h $DB_HOST -c "SELECT count(*) FROM pg_stat_activity"
```
**If connections > 90:** Pool is saturated.
**Next step:** Increase pool size or investigate slow queries.
### Hypothesis 2: High Traffic Spike
**Test:**
```bash
# Check request rate
curl -H "Authorization: Bearer $DD_API_KEY" \
"https://api.datadoghq.com/api/v1/query?query=sum:nginx.requests{*}"
```
**If requests 3x normal:** Traffic spike.
**Next step:** Scale up pods or enable rate limiting.
### Hypothesis 3: External Service Issues Found: