Back to Skills

model-evaluator

verified

Comprehensive ML model evaluation with multiple metrics, cross-validation, and statistical testing. Activates for "evaluate model", "model metrics", "model performance", "compare models", "validation metrics", "test accuracy", "precision recall", "ROC AUC". Generates detailed evaluation reports with visualizations and statistical significance tests, integrated with SpecWeave increment documentation.

View on GitHub

Marketplace

specweave

anton-abyzov/specweave

Plugin

sw-ml

development

Repository

anton-abyzov/specweave
27stars

plugins/specweave-ml/skills/model-evaluator/SKILL.md

Last Verified

January 25, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/anton-abyzov/specweave/blob/main/plugins/specweave-ml/skills/model-evaluator/SKILL.md -a claude-code --skill model-evaluator

Installation paths:

Claude
.claude/skills/model-evaluator/
Powered by add-skill CLI

Instructions

# Model Evaluator

## Overview

Provides comprehensive, unbiased model evaluation following ML best practices. Goes beyond simple accuracy to evaluate models across multiple dimensions, ensuring confident deployment decisions.

## Core Evaluation Framework

### 1. Classification Metrics
- Accuracy, Precision, Recall, F1-score
- ROC AUC, PR AUC
- Confusion matrix
- Per-class metrics (for multi-class)
- Class imbalance handling

### 2. Regression Metrics
- RMSE, MAE, MAPE
- R² score, Adjusted R²
- Residual analysis
- Prediction interval coverage

### 3. Ranking Metrics (Recommendations)
- Precision@K, Recall@K
- NDCG@K, MAP@K
- MRR (Mean Reciprocal Rank)
- Coverage, Diversity

### 4. Statistical Validation
- Cross-validation (K-fold, stratified, time-series)
- Confidence intervals
- Statistical significance testing
- Calibration curves

## Usage

```python
from specweave import ModelEvaluator

evaluator = ModelEvaluator(
    model=trained_model,
    X_test=X_test,
    y_test=y_test,
    increment="0042"
)

# Comprehensive evaluation
report = evaluator.evaluate_all()

# Generates:
# - .specweave/increments/0042.../evaluation-report.md
# - Visualizations (confusion matrix, ROC curves, etc.)
# - Statistical tests
```

## Evaluation Report Structure

```markdown
# Model Evaluation Report: XGBoost Classifier

## Overall Performance
- **Accuracy**: 0.87 ± 0.02 (95% CI: [0.85, 0.89])
- **ROC AUC**: 0.92 ± 0.01
- **F1 Score**: 0.85 ± 0.02

## Per-Class Performance
| Class   | Precision | Recall | F1   | Support |
|---------|-----------|--------|------|---------|
| Class 0 | 0.88      | 0.85   | 0.86 | 1000    |
| Class 1 | 0.84      | 0.87   | 0.86 | 800     |

## Confusion Matrix
[Visualization embedded]

## Cross-Validation Results
- 5-fold CV accuracy: 0.86 ± 0.03
- Fold scores: [0.85, 0.88, 0.84, 0.87, 0.86]
- No overfitting detected (train=0.89, val=0.86, gap=0.03)

## Statistical Tests
- Comparison vs baseline: p=0.001 (highly significant)
- Comparison vs previous model

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
3532 chars