Back to Skills

observability

verified

Full-stack observability architect for Prometheus, Grafana, OpenTelemetry, distributed tracing (Jaeger/Tempo), SLIs/SLOs, error budgets, and alerting. Use for metrics, dashboards, traces, or reliability engineering.

View on GitHub

Marketplace

specweave

anton-abyzov/specweave

Plugin

sw-infra

development

Repository

anton-abyzov/specweave
31stars

plugins/specweave-infrastructure/skills/observability/SKILL.md

Last Verified

February 4, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/anton-abyzov/specweave/blob/main/plugins/specweave-infrastructure/skills/observability/SKILL.md -a claude-code --skill observability

Installation paths:

Claude
.claude/skills/observability/
Powered by add-skill CLI

Instructions

# Observability Engineer - Full-Stack Monitoring Expert

## ⚠️ Chunking Rule

Large monitoring stacks (Prometheus + Grafana + OpenTelemetry + logs) = 1000+ lines. Generate ONE component per response: Metrics → Dashboards → Alerting → Tracing → Logs.

## Purpose

Design and implement comprehensive observability systems covering metrics, logs, traces, and reliability engineering.

## When to Use

- Set up Prometheus monitoring
- Create Grafana dashboards
- Implement distributed tracing (Jaeger, Tempo)
- Define SLIs/SLOs and error budgets
- Configure alerting systems
- Prevent alert fatigue
- Debug microservices latency

## Core Concepts

### Three Pillars of Observability

```
┌─────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY                             │
├─────────────────┬─────────────────┬─────────────────────────┤
│    METRICS      │     LOGS        │        TRACES           │
├─────────────────┼─────────────────┼─────────────────────────┤
│ Prometheus      │ Loki/ELK        │ Jaeger/Tempo            │
│ What happened?  │ Why happened?   │ How requests flow?      │
│ Aggregated data │ Event details   │ Request journey         │
└─────────────────┴─────────────────┴─────────────────────────┘
```

### RED Method (Services)
- **Rate** - Requests per second
- **Errors** - Error rate percentage
- **Duration** - Latency/response time

### USE Method (Resources)
- **Utilization** - % time resource is busy
- **Saturation** - Queue length/wait time
- **Errors** - Error count

## Prometheus Setup

### Installation (Kubernetes)

```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.retention=30d
```

### Key Configuration

```yaml
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kuber

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
5620 chars