Detect quiet failures in LLM agents - tool skipping, gibberish outputs, infinite loops, and degraded quality. Use when agents appear to work but produce incorrect results.
View on GitHubyonatangross/orchestkit
ork-ai-observability
February 4, 2026
Select agents to install to:
npx add-skill https://github.com/yonatangross/orchestkit/blob/main/plugins/ork-ai-observability/skills/silent-failure-detection/SKILL.md -a claude-code --skill silent-failure-detectionInstallation paths:
.claude/skills/silent-failure-detection/# Silent Failure Detection
Detect when LLM agents fail silently - appearing to work while producing incorrect results.
## Overview
- Detecting when agents skip expected tool calls
- Identifying gibberish or degraded output quality
- Monitoring for infinite loops and token consumption spikes
- Setting up statistical baselines for anomaly detection
- Alerting on non-error failures (service up but logic broken)
## Quick Reference
### Tool Skipping Detection
```python
from langfuse import Langfuse
def check_tool_usage(trace_id: str, expected_tools: list[str]) -> dict:
"""
Detect when agent skips expected tool calls.
Based on Akamai's middleware bug: agents stopped using tools
when hidden middleware injected unexpected instructions.
"""
langfuse = Langfuse()
trace = langfuse.fetch_trace(trace_id)
# Extract tool calls from trace
actual_tools = [
span.name for span in trace.observations
if span.type == "tool"
]
missing_tools = set(expected_tools) - set(actual_tools)
if missing_tools:
return {
"alert": True,
"type": "tool_skipping",
"missing": list(missing_tools),
"message": f"Agent skipped expected tools: {missing_tools}"
}
return {"alert": False}
```
### Gibberish/Quality Detection
```python
from langfuse.decorators import observe, langfuse_context
@observe(name="quality_check")
async def detect_gibberish(response: str) -> dict:
"""
Detect low-quality or gibberish outputs using LLM-as-judge.
"""
# Quick heuristics first
if len(response) < 10:
return {"alert": True, "type": "too_short"}
if len(set(response.split())) / len(response.split()) < 0.3:
return {"alert": True, "type": "repetitive"}
# LLM-as-judge for quality
judge_prompt = f"""
Rate this response quality (0-1):
- 0: Gibberish, nonsensical, or completely wrong
- 0.5: Partially correct but missing key informatio