anthropic-evaluations

# Anthropic Evaluations

Build rigorous evaluations for AI agents using Anthropic's proven patterns.

## Quick Reference

You MUST read the reference files for detailed guidance:

- [Grader Types](./references/grader-types.md) - Code-based, model-based, human graders
- [Agent Type Patterns](./references/agent-type-patterns.md) - Coding, conversational, research, computer use
- [Roadmap](./references/roadmap.md) - Steps 0-8 for building evals from scratch
- [Frameworks](./references/frameworks.md) - Harbor, Promptfoo, Braintrust, etc.

**YAML Templates:**
- [coding-agent-eval.yaml](./references/coding-agent-eval.yaml) - Coding agent template
- [conversational-agent-eval.yaml](./references/conversational-agent-eval.yaml) - Support agent template

**Annotated Examples:**
- [Example: Coding Agent](./references/example-coding-agent.md) - Auth bypass fix walkthrough
- [Example: Conversational](./references/example-conversational.md) - Refund handling walkthrough

## Core Definitions

| Term | Definition |
|------|------------|
| **Task** | Single test with defined inputs and success criteria |
| **Trial** | One attempt at a task (run multiple for consistency) |
| **Grader** | Logic that scores agent performance; tasks can have multiple |
| **Transcript** | Complete record of a trial (outputs, tool calls, reasoning) |
| **Outcome** | Final state in environment (not just what agent said) |
| **Evaluation harness** | Infrastructure that runs evals end-to-end |
| **Agent harness** | System enabling model to act as agent (scaffold) |
| **Evaluation suite** | Collection of tasks measuring specific capabilities |

## Grader Types (Quick Reference)

| Type | Methods | Best For |
|------|---------|----------|
| **Code-based** | String match, unit tests, static analysis, state checks | Fast, cheap, objective verification |
| **Model-based** | Rubric scoring, assertions, pairwise comparison | Nuanced, open-ended tasks |
| **Human** | SME review, A/B testing, spot-check sampling | Go
Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details