Back to Skills

grey-haven-evaluation

verified

Evaluate LLM outputs with multi-dimensional rubrics, handle non-determinism, and implement LLM-as-judge patterns. Essential for production LLM systems. Use when testing prompts, validating outputs, comparing models, or when user mentions 'evaluation', 'testing LLM', 'rubric', 'LLM-as-judge', 'output quality', 'prompt testing', or 'model comparison'.

View on GitHub

Marketplace

grey-haven-plugins

greyhaven-ai/claude-code-config

Plugin

core

Repository

greyhaven-ai/claude-code-config
17stars

grey-haven-plugins/core/skills/evaluation/SKILL.md

Last Verified

January 21, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/greyhaven-ai/claude-code-config/blob/main/grey-haven-plugins/core/skills/evaluation/SKILL.md -a claude-code --skill grey-haven-evaluation

Installation paths:

Claude
.claude/skills/grey-haven-evaluation/
Powered by add-skill CLI

Instructions

# Evaluation Skill

Evaluate LLM outputs systematically with rubrics, handle non-determinism, and implement LLM-as-judge patterns.

## Core Insight: The 95% Variance Finding

Research shows **95% of output variance** comes from just two sources:
- **80%** from prompt tokens (wording, structure, examples)
- **15%** from random seed/sampling

Temperature, model version, and other factors account for only 5%.

**Implication**: Focus evaluation on prompt quality, not model tweaking.

## What's Included

### Examples (`examples/`)
- **Prompt comparison** - A/B testing prompts with rubrics
- **Model evaluation** - Comparing outputs across models
- **Regression testing** - Detecting output degradation

### Reference Guides (`reference/`)
- **Rubric design** - Multi-dimensional evaluation criteria
- **LLM-as-judge** - Using LLMs to evaluate LLM outputs
- **Statistical methods** - Handling non-determinism

### Templates (`templates/`)
- **Rubric templates** - Ready-to-use evaluation criteria
- **Judge prompts** - LLM-as-judge prompt templates
- **Test case format** - Structured test case templates

### Checklists (`checklists/`)
- **Evaluation setup** - Before running evaluations
- **Rubric validation** - Ensuring rubric quality

## Key Concepts

### 1. Multi-Dimensional Rubrics

Don't use single scores. Break down evaluation into dimensions:

| Dimension | Weight | Criteria |
|-----------|--------|----------|
| Accuracy | 30% | Factually correct, no hallucinations |
| Completeness | 25% | Addresses all requirements |
| Clarity | 20% | Well-organized, easy to understand |
| Conciseness | 15% | No unnecessary content |
| Format | 10% | Follows specified structure |

### 2. Handling Non-Determinism

LLMs are non-deterministic. Handle with:

```
Strategy 1: Multiple Runs
- Run same prompt 3-5 times
- Report mean and variance
- Flag high-variance cases

Strategy 2: Seed Control
- Set temperature=0 for reproducibility
- Document seed for debugging
- Accept some variation is norma

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
5425 chars

Issues Found:

  • name_directory_mismatch