Evaluate LLM outputs with multi-dimensional rubrics, handle non-determinism, and implement LLM-as-judge patterns. Essential for production LLM systems. Use when testing prompts, validating outputs, comparing models, or when user mentions 'evaluation', 'testing LLM', 'rubric', 'LLM-as-judge', 'output quality', 'prompt testing', or 'model comparison'.
View on GitHubgreyhaven-ai/claude-code-config
core
January 21, 2026
Select agents to install to:
npx add-skill https://github.com/greyhaven-ai/claude-code-config/blob/main/grey-haven-plugins/core/skills/evaluation/SKILL.md -a claude-code --skill grey-haven-evaluationInstallation paths:
.claude/skills/grey-haven-evaluation/# Evaluation Skill Evaluate LLM outputs systematically with rubrics, handle non-determinism, and implement LLM-as-judge patterns. ## Core Insight: The 95% Variance Finding Research shows **95% of output variance** comes from just two sources: - **80%** from prompt tokens (wording, structure, examples) - **15%** from random seed/sampling Temperature, model version, and other factors account for only 5%. **Implication**: Focus evaluation on prompt quality, not model tweaking. ## What's Included ### Examples (`examples/`) - **Prompt comparison** - A/B testing prompts with rubrics - **Model evaluation** - Comparing outputs across models - **Regression testing** - Detecting output degradation ### Reference Guides (`reference/`) - **Rubric design** - Multi-dimensional evaluation criteria - **LLM-as-judge** - Using LLMs to evaluate LLM outputs - **Statistical methods** - Handling non-determinism ### Templates (`templates/`) - **Rubric templates** - Ready-to-use evaluation criteria - **Judge prompts** - LLM-as-judge prompt templates - **Test case format** - Structured test case templates ### Checklists (`checklists/`) - **Evaluation setup** - Before running evaluations - **Rubric validation** - Ensuring rubric quality ## Key Concepts ### 1. Multi-Dimensional Rubrics Don't use single scores. Break down evaluation into dimensions: | Dimension | Weight | Criteria | |-----------|--------|----------| | Accuracy | 30% | Factually correct, no hallucinations | | Completeness | 25% | Addresses all requirements | | Clarity | 20% | Well-organized, easy to understand | | Conciseness | 15% | No unnecessary content | | Format | 10% | Follows specified structure | ### 2. Handling Non-Determinism LLMs are non-deterministic. Handle with: ``` Strategy 1: Multiple Runs - Run same prompt 3-5 times - Report mean and variance - Flag high-variance cases Strategy 2: Seed Control - Set temperature=0 for reproducibility - Document seed for debugging - Accept some variation is norma
Issues Found: