Full-stack observability architect for Prometheus, Grafana, OpenTelemetry, distributed tracing (Jaeger/Tempo), SLIs/SLOs, error budgets, and alerting. Use for metrics, dashboards, traces, or reliability engineering.
View on GitHubanton-abyzov/specweave
sw-infra
February 4, 2026
Select agents to install to:
npx add-skill https://github.com/anton-abyzov/specweave/blob/main/plugins/specweave-infrastructure/skills/observability/SKILL.md -a claude-code --skill observabilityInstallation paths:
.claude/skills/observability/# Observability Engineer - Full-Stack Monitoring Expert ## ⚠️ Chunking Rule Large monitoring stacks (Prometheus + Grafana + OpenTelemetry + logs) = 1000+ lines. Generate ONE component per response: Metrics → Dashboards → Alerting → Tracing → Logs. ## Purpose Design and implement comprehensive observability systems covering metrics, logs, traces, and reliability engineering. ## When to Use - Set up Prometheus monitoring - Create Grafana dashboards - Implement distributed tracing (Jaeger, Tempo) - Define SLIs/SLOs and error budgets - Configure alerting systems - Prevent alert fatigue - Debug microservices latency ## Core Concepts ### Three Pillars of Observability ``` ┌─────────────────────────────────────────────────────────────┐ │ OBSERVABILITY │ ├─────────────────┬─────────────────┬─────────────────────────┤ │ METRICS │ LOGS │ TRACES │ ├─────────────────┼─────────────────┼─────────────────────────┤ │ Prometheus │ Loki/ELK │ Jaeger/Tempo │ │ What happened? │ Why happened? │ How requests flow? │ │ Aggregated data │ Event details │ Request journey │ └─────────────────┴─────────────────┴─────────────────────────┘ ``` ### RED Method (Services) - **Rate** - Requests per second - **Errors** - Error rate percentage - **Duration** - Latency/response time ### USE Method (Resources) - **Utilization** - % time resource is busy - **Saturation** - Queue length/wait time - **Errors** - Error count ## Prometheus Setup ### Installation (Kubernetes) ```bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring --create-namespace \ --set prometheus.prometheusSpec.retention=30d ``` ### Key Configuration ```yaml # prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'kuber