Evaluate LLM systems using automated metrics, LLM-as-judge, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety (hallucinations, bias), or comparing models for production deployment.
View on GitHubancoleman/ai-design-components
backend-ai-skills
February 1, 2026
Select agents to install to:
npx add-skill https://github.com/ancoleman/ai-design-components/blob/main/skills/evaluating-llms/SKILL.md -a claude-code --skill evaluating-llmsInstallation paths:
.claude/skills/evaluating-llms/# LLM Evaluation Evaluate Large Language Model (LLM) systems using automated metrics, LLM-as-judge patterns, and standardized benchmarks to ensure production quality and safety. ## When to Use This Skill Apply this skill when: - Testing individual prompts for correctness and formatting - Validating RAG (Retrieval-Augmented Generation) pipeline quality - Measuring hallucinations, bias, or toxicity in LLM outputs - Comparing different models or prompt configurations (A/B testing) - Running benchmark tests (MMLU, HumanEval) to assess model capabilities - Setting up production monitoring for LLM applications - Integrating LLM quality checks into CI/CD pipelines Common triggers: - "How do I test if my RAG system is working correctly?" - "How can I measure hallucinations in LLM outputs?" - "What metrics should I use to evaluate generation quality?" - "How do I compare GPT-4 vs Claude for my use case?" - "How do I detect bias in LLM responses?" ## Evaluation Strategy Selection ### Decision Framework: Which Evaluation Approach? **By Task Type:** | Task Type | Primary Approach | Metrics | Tools | |-----------|------------------|---------|-------| | **Classification** (sentiment, intent) | Automated metrics | Accuracy, Precision, Recall, F1 | scikit-learn | | **Generation** (summaries, creative text) | LLM-as-judge + automated | BLEU, ROUGE, BERTScore, Quality rubric | GPT-4/Claude for judging | | **Question Answering** | Exact match + semantic similarity | EM, F1, Cosine similarity | Custom evaluators | | **RAG Systems** | RAGAS framework | Faithfulness, Answer/Context relevance | RAGAS library | | **Code Generation** | Unit tests + execution | Pass@K, Test pass rate | HumanEval, pytest | | **Multi-step Agents** | Task completion + tool accuracy | Success rate, Efficiency | Custom evaluators | **By Volume and Cost:** | Samples | Speed | Cost | Recommended Approach | |---------|-------|------|---------------------| | 1,000+ | Immediate | $0 | Automated metrics (reg