This skill should be used when the user asks "what evals can we create", "how do I evaluate this", "design an eval", "create evals for", "how do I know if my LLM is working", "measure quality", or mentions evals, evaluation, scoring rubrics, golden datasets, LLM-as-judge, quality metrics, or judge prompts.
View on GitHubtavva/ben-claude-plugins
eval-designer
plugins/eval-designer/skills/eval-design/SKILL.md
January 21, 2026
Select agents to install to:
npx add-skill https://github.com/tavva/ben-claude-plugins/blob/main/plugins/eval-designer/skills/eval-design/SKILL.md -a claude-code --skill eval-designInstallation paths:
.claude/skills/eval-design/# Eval Design
Guide users through designing production-quality LLM evaluations. Output: structured spec a coding agent can implement with Langfuse.
**Announce at start:** "I'm using the eval-design skill to help design your evaluation."
## When to Use
```dot
digraph when_to_use {
"User question" [shape=box];
"About measuring LLM output quality?" [shape=diamond];
"Use this skill" [shape=box, style=filled, fillcolor=lightgreen];
"Not this skill" [shape=box];
"User question" -> "About measuring LLM output quality?";
"About measuring LLM output quality?" -> "Use this skill" [label="yes"];
"About measuring LLM output quality?" -> "Not this skill" [label="no"];
}
```
**Use for:**
- Designing new evaluations
- Choosing between eval approaches
- Creating scoring rubrics or judge prompts
- Building golden datasets
- Questions like "how do I know if my LLM is working?"
**Not for:**
- Langfuse setup, tracing, or observability (use langfuse-cli skill)
- Implementing evals (output spec to coding agent)
- Generic testing that isn't LLM-specific
## The Process
```dot
digraph eval_design_flow {
rankdir=TB;
understand [label="Understand\nthe System" shape=box];
failures [label="Identify\nFailure Modes" shape=box];
match [label="Match Eval Type\nto Problem" shape=box];
design [label="Design\nthe Eval" shape=box];
output [label="Output\nSpec" shape=box];
understand -> failures -> match -> design -> output;
}
```
Ask questions ONE AT A TIME. Adapt depth based on user's experience level.
## Phase 1: Understand the System
Ask about:
- What does the LLM application do? (summarisation, Q&A, agent, chat)
- What's already instrumented in Langfuse? (traces, spans, generations)
- What does a "good" output look like for users?
## Phase 2: Identify Failure Modes
Key questions:
- What failures have you seen in production?
- Have you done error analysis on real traces?
- Which failures have the highest business impact?
**Critical:** Ground eval design i