Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building evaluator-optimizer pipelines for quality-critical generation - Creating test-driven code refinement workflows - Designing rubric-based or LLM-as-judge evaluation systems - Adding iterative improvement to agent outputs (code, reports, analysis) - Measuring and improving agent response quality
View on GitHubFebruary 1, 2026
Select agents to install to:
npx add-skill https://github.com/github/awesome-copilot/blob/ccdfd66cc2311cac15b9e22073a9314a1c0d0fce/skills/agentic-eval/SKILL.md -a claude-code --skill agentic-evalInstallation paths:
.claude/skills/agentic-eval/# Agentic Evaluation Patterns
Patterns for self-improvement through iterative evaluation and refinement.
## Overview
Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.
```
Generate → Evaluate → Critique → Refine → Output
↑ │
└──────────────────────────────┘
```
## When to Use
- **Quality-critical generation**: Code, reports, analysis requiring high accuracy
- **Tasks with clear evaluation criteria**: Defined success metrics exist
- **Content requiring specific standards**: Style guides, compliance, formatting
---
## Pattern 1: Basic Reflection
Agent evaluates and improves its own output through self-critique.
```python
def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
"""Generate with reflection loop."""
output = llm(f"Complete this task:\n{task}")
for i in range(max_iterations):
# Self-critique
critique = llm(f"""
Evaluate this output against criteria: {criteria}
Output: {output}
Rate each: PASS/FAIL with feedback as JSON.
""")
critique_data = json.loads(critique)
all_pass = all(c["status"] == "PASS" for c in critique_data.values())
if all_pass:
return output
# Refine based on critique
failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
output = llm(f"Improve to address: {failed}\nOriginal: {output}")
return output
```
**Key insight**: Use structured JSON output for reliable parsing of critique results.
---
## Pattern 2: Evaluator-Optimizer
Separate generation and evaluation into distinct components for clearer responsibilities.
```python
class EvaluatorOptimizer:
def __init__(self, score_threshold: float = 0.8):
self.score_threshold = score_threshold
def generate(self, task: str) -