Back to Skills

eval-design

verified

This skill should be used when the user asks "what evals can we create", "how do I evaluate this", "design an eval", "create evals for", "how do I know if my LLM is working", "measure quality", or mentions evals, evaluation, scoring rubrics, golden datasets, LLM-as-judge, quality metrics, or judge prompts.

View on GitHub

Marketplace

ben-claude-plugins

tavva/ben-claude-plugins

Plugin

eval-designer

development

Repository

tavva/ben-claude-plugins

plugins/eval-designer/skills/eval-design/SKILL.md

Last Verified

January 21, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/tavva/ben-claude-plugins/blob/main/plugins/eval-designer/skills/eval-design/SKILL.md -a claude-code --skill eval-design

Installation paths:

Claude
.claude/skills/eval-design/
Powered by add-skill CLI

Instructions

# Eval Design

Guide users through designing production-quality LLM evaluations. Output: structured spec a coding agent can implement with Langfuse.

**Announce at start:** "I'm using the eval-design skill to help design your evaluation."

## When to Use

```dot
digraph when_to_use {
  "User question" [shape=box];
  "About measuring LLM output quality?" [shape=diamond];
  "Use this skill" [shape=box, style=filled, fillcolor=lightgreen];
  "Not this skill" [shape=box];

  "User question" -> "About measuring LLM output quality?";
  "About measuring LLM output quality?" -> "Use this skill" [label="yes"];
  "About measuring LLM output quality?" -> "Not this skill" [label="no"];
}
```

**Use for:**
- Designing new evaluations
- Choosing between eval approaches
- Creating scoring rubrics or judge prompts
- Building golden datasets
- Questions like "how do I know if my LLM is working?"

**Not for:**
- Langfuse setup, tracing, or observability (use langfuse-cli skill)
- Implementing evals (output spec to coding agent)
- Generic testing that isn't LLM-specific

## The Process

```dot
digraph eval_design_flow {
  rankdir=TB;

  understand [label="Understand\nthe System" shape=box];
  failures [label="Identify\nFailure Modes" shape=box];
  match [label="Match Eval Type\nto Problem" shape=box];
  design [label="Design\nthe Eval" shape=box];
  output [label="Output\nSpec" shape=box];

  understand -> failures -> match -> design -> output;
}
```

Ask questions ONE AT A TIME. Adapt depth based on user's experience level.

## Phase 1: Understand the System

Ask about:
- What does the LLM application do? (summarisation, Q&A, agent, chat)
- What's already instrumented in Langfuse? (traces, spans, generations)
- What does a "good" output look like for users?

## Phase 2: Identify Failure Modes

Key questions:
- What failures have you seen in production?
- Have you done error analysis on real traces?
- Which failures have the highest business impact?

**Critical:** Ground eval design i

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
5978 chars