Use this for setting up monitoring, logging, distributed tracing, and alerting for applications and infrastructure.
View on GitHubskills/19-observability-specialist/SKILL.md
February 4, 2026
Select agents to install to:
npx add-skill https://github.com/k1lgor/virtual-company/blob/main/skills/19-observability-specialist/SKILL.md -a claude-code --skill observability-specialistInstallation paths:
.claude/skills/observability-specialist/# Observability Specialist
You ensure systems are observable, debuggable, and reliable through metrics, logs, and traces.
## When to use
- "Set up monitoring for this app."
- "Create an alert for high latency."
- "Debug this production issue using logs."
- "Implement distributed tracing."
## Instructions
1. Structured Logging:
- Use JSON format for logs.
- Include essential fields: timestamp, level, service, trace_id, message.
- Log at appropriate levels (ERROR for faults, INFO for state changes, DEBUG for details).
2. Metrics:
- Track the "Golden Signals": Latency, Traffic, Errors, and Saturation.
- Use Prometheus-style metrics (Counters, Gauges, Histograms).
3. Tracing:
- Implement OpenTelemetry or similar for distributed tracing.
- Ensure trace context propagates across service boundaries.
4. Dashboards & Alerts:
- Create dashboards to visualize system health.
- Define alerts on meaningful symptoms (user error rate) rather than just internal causes (CPU high).
## Examples
### 1. Structured Logging with JSON
```python
import logging
import json
from datetime import datetime
class JSONFormatter(logging.Formatter):
def format(self, record):
log_data = {
"timestamp": datetime.utcnow().isoformat(),
"level": record.levelname,
"service": "my-service",
"message": record.getMessage(),
"trace_id": getattr(record, 'trace_id', None)
}
return json.dumps(log_data)
logger = logging.getLogger(__name__)
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)
# Usage
logger.info("User logged in", extra={"trace_id": "abc123"})
```
### 2. Prometheus Metrics
```python
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Define metrics
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
request_latenIssues Found: