Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
View on GitHubHermeticOrmus/after-the-third-cup
llm-application-dev
plugins/llm-application-dev/skills/llm-evaluation/SKILL.md
January 21, 2026
Select agents to install to:
npx add-skill https://github.com/HermeticOrmus/after-the-third-cup/blob/main/plugins/llm-application-dev/skills/llm-evaluation/SKILL.md -a claude-code --skill llm-evaluationInstallation paths:
.claude/skills/llm-evaluation/# LLM Evaluation
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
## When to Use This Skill
- Measuring LLM application performance systematically
- Comparing different models or prompts
- Detecting performance regressions before deployment
- Validating improvements from prompt changes
- Building confidence in production systems
- Establishing baselines and tracking progress over time
- Debugging unexpected model behavior
## Core Evaluation Types
### 1. Automated Metrics
Fast, repeatable, scalable evaluation using computed scores.
**Text Generation:**
- **BLEU**: N-gram overlap (translation)
- **ROUGE**: Recall-oriented (summarization)
- **METEOR**: Semantic similarity
- **BERTScore**: Embedding-based similarity
- **Perplexity**: Language model confidence
**Classification:**
- **Accuracy**: Percentage correct
- **Precision/Recall/F1**: Class-specific performance
- **Confusion Matrix**: Error patterns
- **AUC-ROC**: Ranking quality
**Retrieval (RAG):**
- **MRR**: Mean Reciprocal Rank
- **NDCG**: Normalized Discounted Cumulative Gain
- **Precision@K**: Relevant in top K
- **Recall@K**: Coverage in top K
### 2. Human Evaluation
Manual assessment for quality aspects difficult to automate.
**Dimensions:**
- **Accuracy**: Factual correctness
- **Coherence**: Logical flow
- **Relevance**: Answers the question
- **Fluency**: Natural language quality
- **Safety**: No harmful content
- **Helpfulness**: Useful to the user
### 3. LLM-as-Judge
Use stronger LLMs to evaluate weaker model outputs.
**Approaches:**
- **Pointwise**: Score individual responses
- **Pairwise**: Compare two responses
- **Reference-based**: Compare to gold standard
- **Reference-free**: Judge without ground truth
## Quick Start
```python
from llm_eval import EvaluationSuite, Metric
# Define evaluation suite
suite = EvaluationSuite([
Metric.accuracy(