eval-design

# Eval Design

Guide users through designing production-quality LLM evaluations. Output: structured spec a coding agent can implement with Langfuse.

**Announce at start:** "I'm using the eval-design skill to help design your evaluation."

## When to Use

```dot
digraph when_to_use {
  "User question" [shape=box];
  "About measuring LLM output quality?" [shape=diamond];
  "Use this skill" [shape=box, style=filled, fillcolor=lightgreen];
  "Not this skill" [shape=box];

  "User question" -> "About measuring LLM output quality?";
  "About measuring LLM output quality?" -> "Use this skill" [label="yes"];
  "About measuring LLM output quality?" -> "Not this skill" [label="no"];
}
```

**Use for:**
- Designing new evaluations
- Choosing between eval approaches
- Creating scoring rubrics or judge prompts
- Building golden datasets
- Questions like "how do I know if my LLM is working?"

**Not for:**
- Langfuse setup, tracing, or observability (use langfuse-cli skill)
- Implementing evals (output spec to coding agent)
- Generic testing that isn't LLM-specific

## The Process

```dot
digraph eval_design_flow {
  rankdir=TB;

  understand [label="Understand\nthe System" shape=box];
  failures [label="Identify\nFailure Modes" shape=box];
  match [label="Match Eval Type\nto Problem" shape=box];
  design [label="Design\nthe Eval" shape=box];
  output [label="Output\nSpec" shape=box];

  understand -> failures -> match -> design -> output;
}
```

Ask questions ONE AT A TIME. Adapt depth based on user's experience level.

## Phase 1: Understand the System

Ask about:
- What does the LLM application do? (summarisation, Q&A, agent, chat)
- What's already instrumented in Langfuse? (traces, spans, generations)
- What does a "good" output look like for users?

## Phase 2: Identify Failure Modes

Key questions:
- What failures have you seen in production?
- Have you done error analysis on real traces?
- Which failures have the highest business impact?

**Critical:** Ground eval design i
Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details