Back to Skills

runbooks-troubleshooting-guides

verified

Use when creating troubleshooting guides and diagnostic procedures for operational issues. Covers problem diagnosis, root cause analysis, and systematic debugging.

View on GitHub

Marketplace

han

TheBushidoCollective/han

Plugin

jutsu-runbooks

Technique

Repository

TheBushidoCollective/han
60stars

jutsu/jutsu-runbooks/skills/troubleshooting-guides/SKILL.md

Last Verified

January 24, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/TheBushidoCollective/han/blob/main/jutsu/jutsu-runbooks/skills/troubleshooting-guides/SKILL.md -a claude-code --skill runbooks-troubleshooting-guides

Installation paths:

Claude
.claude/skills/runbooks-troubleshooting-guides/
Powered by add-skill CLI

Instructions

# Runbooks - Troubleshooting Guides

Creating effective troubleshooting guides for diagnosing and resolving operational issues.

## Troubleshooting Framework

### The 5-Step Method

1. **Observe** - Gather symptoms and data
2. **Hypothesize** - Form theories about root cause
3. **Test** - Validate hypotheses with experiments
4. **Fix** - Apply solution
5. **Verify** - Confirm resolution

## Basic Troubleshooting Guide

```markdown
# Troubleshooting: [Problem Statement]

## Symptoms

What the user/system is experiencing:
- API returning 503 errors
- Response time > 10 seconds
- High CPU usage alerts

## Quick Checks (< 2 minutes)

### 1. Is the service running?
```bash
kubectl get pods -n production | grep api-server
```

**Expected:** STATUS = Running

### 2. Are recent deploys the cause?

```bash
kubectl rollout history deployment/api-server
```

**Check:** Did we deploy in the last 30 minutes?

### 3. Is this affecting all users?

Check error rate in Datadog:

- If < 5%: Isolated issue, may be client-specific
- If > 50%: Widespread issue, likely infrastructure

## Common Causes

| Symptom | Likely Cause | Quick Fix |
|---------|-------------|-----------|
| 503 errors | Pod crashlooping | Restart deployment |
| Slow responses | Database connection pool | Increase pool size |
| High memory | Memory leak | Restart pods |

## Detailed Diagnosis

### Hypothesis 1: Database Connection Issues

**Test:**

```bash
# Check database connections
kubectl exec -it api-server-abc -- psql -h $DB_HOST -c "SELECT count(*) FROM pg_stat_activity"
```

**If connections > 90:** Pool is saturated.
**Next step:** Increase pool size or investigate slow queries.

### Hypothesis 2: High Traffic Spike

**Test:**

```bash
# Check request rate
curl -H "Authorization: Bearer $DD_API_KEY" \
  "https://api.datadoghq.com/api/v1/query?query=sum:nginx.requests{*}"
```

**If requests 3x normal:** Traffic spike.
**Next step:** Scale up pods or enable rate limiting.

### Hypothesis 3: External Service 

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
11170 chars

Issues Found:

  • name_directory_mismatch