Back to Skills

evaluation-metrics

verified

LLM evaluation frameworks, benchmarks, and quality metrics for production systems.

View on GitHub

Marketplace

pluginagentmarketplace-ai-engineer

pluginagentmarketplace/custom-plugin-ai-engineer

Plugin

ai-engineer-plugin

Repository

pluginagentmarketplace/custom-plugin-ai-engineer
2stars

skills/evaluation-metrics/SKILL.md

Last Verified

January 20, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/pluginagentmarketplace/custom-plugin-ai-engineer/blob/main/skills/evaluation-metrics/SKILL.md -a claude-code --skill evaluation-metrics

Installation paths:

Claude
.claude/skills/evaluation-metrics/
Powered by add-skill CLI

Instructions

# Evaluation Metrics

Measure and improve LLM quality systematically.

## Quick Start

### Basic Evaluation with RAGAS
```python
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)
from datasets import Dataset

# Prepare evaluation data
eval_data = {
    "question": ["What is machine learning?"],
    "answer": ["ML is a subset of AI that learns from data."],
    "contexts": [["Machine learning is a field of AI..."]],
    "ground_truth": ["Machine learning is AI that learns patterns."]
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall
    ]
)

print(results)
```

### LangChain Evaluation
```python
from langchain.evaluation import load_evaluator

# Criteria-based evaluation
evaluator = load_evaluator("criteria", criteria="helpfulness")

result = evaluator.evaluate_strings(
    prediction="Paris is the capital of France.",
    input="What is the capital of France?"
)

print(f"Score: {result['score']}, Reasoning: {result['reasoning']}")
```

## Core Metrics

### Text Generation Metrics
```python
from evaluate import load
import numpy as np

class TextMetrics:
    def __init__(self):
        self.bleu = load("bleu")
        self.rouge = load("rouge")
        self.bertscore = load("bertscore")

    def evaluate(self, predictions: list, references: list) -> dict:
        metrics = {}

        # BLEU - Precision-based n-gram overlap
        bleu_result = self.bleu.compute(
            predictions=predictions,
            references=[[r] for r in references]
        )
        metrics['bleu'] = bleu_result['bleu']

        # ROUGE - Recall-based overlap
        rouge_result = self.rouge.compute(
            predictions=predictions,
            references=references
        )
        metrics['rouge1'] = rouge_result['rouge1']
        metri

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
9355 chars