evaluating-llms

# LLM Evaluation

Evaluate Large Language Model (LLM) systems using automated metrics, LLM-as-judge patterns, and standardized benchmarks to ensure production quality and safety.

## When to Use This Skill

Apply this skill when:

- Testing individual prompts for correctness and formatting
- Validating RAG (Retrieval-Augmented Generation) pipeline quality
- Measuring hallucinations, bias, or toxicity in LLM outputs
- Comparing different models or prompt configurations (A/B testing)
- Running benchmark tests (MMLU, HumanEval) to assess model capabilities
- Setting up production monitoring for LLM applications
- Integrating LLM quality checks into CI/CD pipelines

Common triggers:
- "How do I test if my RAG system is working correctly?"
- "How can I measure hallucinations in LLM outputs?"
- "What metrics should I use to evaluate generation quality?"
- "How do I compare GPT-4 vs Claude for my use case?"
- "How do I detect bias in LLM responses?"

## Evaluation Strategy Selection

### Decision Framework: Which Evaluation Approach?

**By Task Type:**

| Task Type | Primary Approach | Metrics | Tools |
|-----------|------------------|---------|-------|
| **Classification** (sentiment, intent) | Automated metrics | Accuracy, Precision, Recall, F1 | scikit-learn |
| **Generation** (summaries, creative text) | LLM-as-judge + automated | BLEU, ROUGE, BERTScore, Quality rubric | GPT-4/Claude for judging |
| **Question Answering** | Exact match + semantic similarity | EM, F1, Cosine similarity | Custom evaluators |
| **RAG Systems** | RAGAS framework | Faithfulness, Answer/Context relevance | RAGAS library |
| **Code Generation** | Unit tests + execution | Pass@K, Test pass rate | HumanEval, pytest |
| **Multi-step Agents** | Task completion + tool accuracy | Success rate, Efficiency | Custom evaluators |

**By Volume and Cost:**

| Samples | Speed | Cost | Recommended Approach |
|---------|-------|------|---------------------|
| 1,000+ | Immediate | $0 | Automated metrics (reg
Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details