LLM evaluation frameworks, benchmarks, and quality metrics for production systems.
View on GitHubpluginagentmarketplace/custom-plugin-ai-engineer
ai-engineer-plugin
January 20, 2026
Select agents to install to:
npx add-skill https://github.com/pluginagentmarketplace/custom-plugin-ai-engineer/blob/main/skills/evaluation-metrics/SKILL.md -a claude-code --skill evaluation-metricsInstallation paths:
.claude/skills/evaluation-metrics/# Evaluation Metrics
Measure and improve LLM quality systematically.
## Quick Start
### Basic Evaluation with RAGAS
```python
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall
)
from datasets import Dataset
# Prepare evaluation data
eval_data = {
"question": ["What is machine learning?"],
"answer": ["ML is a subset of AI that learns from data."],
"contexts": [["Machine learning is a field of AI..."]],
"ground_truth": ["Machine learning is AI that learns patterns."]
}
dataset = Dataset.from_dict(eval_data)
# Run evaluation
results = evaluate(
dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall
]
)
print(results)
```
### LangChain Evaluation
```python
from langchain.evaluation import load_evaluator
# Criteria-based evaluation
evaluator = load_evaluator("criteria", criteria="helpfulness")
result = evaluator.evaluate_strings(
prediction="Paris is the capital of France.",
input="What is the capital of France?"
)
print(f"Score: {result['score']}, Reasoning: {result['reasoning']}")
```
## Core Metrics
### Text Generation Metrics
```python
from evaluate import load
import numpy as np
class TextMetrics:
def __init__(self):
self.bleu = load("bleu")
self.rouge = load("rouge")
self.bertscore = load("bertscore")
def evaluate(self, predictions: list, references: list) -> dict:
metrics = {}
# BLEU - Precision-based n-gram overlap
bleu_result = self.bleu.compute(
predictions=predictions,
references=[[r] for r in references]
)
metrics['bleu'] = bleu_result['bleu']
# ROUGE - Recall-based overlap
rouge_result = self.rouge.compute(
predictions=predictions,
references=references
)
metrics['rouge1'] = rouge_result['rouge1']
metri