groq-incident-runbook

# Groq Incident Runbook

## Overview
Rapid incident response procedures for Groq-related outages.

## Prerequisites
- Access to Groq dashboard and status page
- kubectl access to production cluster
- Prometheus/Grafana access
- Communication channels (Slack, PagerDuty)

## Severity Levels

| Level | Definition | Response Time | Examples |
|-------|------------|---------------|----------|
| P1 | Complete outage | < 15 min | Groq API unreachable |
| P2 | Degraded service | < 1 hour | High latency, partial failures |
| P3 | Minor impact | < 4 hours | Webhook delays, non-critical errors |
| P4 | No user impact | Next business day | Monitoring gaps |

## Quick Triage

```bash
# 1. Check Groq status
curl -s https://status.groq.com | jq

# 2. Check our integration health
curl -s https://api.yourapp.com/health | jq '.services.groq'

# 3. Check error rate (last 5 min)
curl -s localhost:9090/api/v1/query?query=rate(groq_errors_total[5m])

# 4. Recent error logs
kubectl logs -l app=groq-integration --since=5m | grep -i error | tail -20
```

## Decision Tree

```
Groq API returning errors?
├─ YES: Is status.groq.com showing incident?
│   ├─ YES → Wait for Groq to resolve. Enable fallback.
│   └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
    ├─ YES → Likely resolved or intermittent. Monitor.
    └─ NO → Our infrastructure issue. Check pods, memory, network.
```

## Immediate Actions by Error Type

### 401/403 - Authentication
```bash
# Verify API key is set
kubectl get secret groq-secrets -o jsonpath='{.data.api-key}' | base64 -d

# Check if key was rotated
# → Verify in Groq dashboard

# Remediation: Update secret and restart pods
kubectl create secret generic groq-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment/groq-integration
```

### 429 - Rate Limited
```bash
# Check rate limit headers
curl -v https://api.groq.com 2>&1 | grep -i rate

# Enable request queuing
Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details