observability

# Observability Engineer - Full-Stack Monitoring Expert

## ⚠️ Chunking Rule

Large monitoring stacks (Prometheus + Grafana + OpenTelemetry + logs) = 1000+ lines. Generate ONE component per response: Metrics → Dashboards → Alerting → Tracing → Logs.

## Purpose

Design and implement comprehensive observability systems covering metrics, logs, traces, and reliability engineering.

## When to Use

- Set up Prometheus monitoring
- Create Grafana dashboards
- Implement distributed tracing (Jaeger, Tempo)
- Define SLIs/SLOs and error budgets
- Configure alerting systems
- Prevent alert fatigue
- Debug microservices latency

## Core Concepts

### Three Pillars of Observability

```
┌─────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY                             │
├─────────────────┬─────────────────┬─────────────────────────┤
│    METRICS      │     LOGS        │        TRACES           │
├─────────────────┼─────────────────┼─────────────────────────┤
│ Prometheus      │ Loki/ELK        │ Jaeger/Tempo            │
│ What happened?  │ Why happened?   │ How requests flow?      │
│ Aggregated data │ Event details   │ Request journey         │
└─────────────────┴─────────────────┴─────────────────────────┘
```

### RED Method (Services)
- **Rate** - Requests per second
- **Errors** - Error rate percentage
- **Duration** - Latency/response time

### USE Method (Resources)
- **Utilization** - % time resource is busy
- **Saturation** - Queue length/wait time
- **Errors** - Error count

## Prometheus Setup

### Installation (Kubernetes)

```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.retention=30d
```

### Key Configuration

```yaml
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kuber
Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details