Back to Skills

eval-harness

verified

为 Claude Code 会话提供的正式评测框架,实现了评测驱动开发(Eval-Driven Development,EDD)原则

View on GitHub

Marketplace

everything-claude-code

xu-xiang/everything-claude-code-zh

Plugin

everything-claude-code

workflow

Repository

xu-xiang/everything-claude-code-zh
25stars

skills/eval-harness/SKILL.md

Last Verified

February 5, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/xu-xiang/everything-claude-code-zh/blob/main/skills/eval-harness/SKILL.md -a claude-code --skill eval-harness

Installation paths:

Claude
.claude/skills/eval-harness/
Powered by add-skill CLI

Instructions

# 评测套件技能(Eval Harness Skill)

一个为 Claude Code 会话提供的正式评测框架,实现了评测驱动开发(Eval-Driven Development,EDD)原则。

## 核心理念(Philosophy)

评测驱动开发(EDD)将评测(Evals)视为“AI 开发的单元测试”:
- 在实现代码之“前”定义预期行为
- 在开发过程中持续运行评测
- 跟踪每次变更带来的回归(Regressions)
- 使用 pass@k 指标来衡量可靠性

## 评测类型

### 能力评测(Capability Evals)
测试 Claude 是否能够完成之前无法完成的任务:
```markdown
[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
  - [ ] Criterion 1
  - [ ] Criterion 2
  - [ ] Criterion 3
Expected Output: Description of expected result
```

### 回归评测(Regression Evals)
确保变更不会破坏现有功能:
```markdown
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
  - existing-test-1: PASS/FAIL
  - existing-test-2: PASS/FAIL
  - existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)
```

## 评分器(Grader)类型

### 1. 基于代码的评分器(Code-Based Grader)
使用代码进行确定性检查:
```bash
# Check if file contains expected pattern
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"

# Check if tests pass
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"

# Check if build succeeds
npm run build && echo "PASS" || echo "FAIL"
```

### 2. 基于模型的评分器(Model-Based Grader)
使用 Claude 评估开放式输出:
```markdown
[MODEL GRADER PROMPT]
Evaluate the following code change:
1. Does it solve the stated problem?
2. Is it well-structured?
3. Are edge cases handled?
4. Is error handling appropriate?

Score: 1-5 (1=poor, 5=excellent)
Reasoning: [explanation]
```

### 3. 人工评分器(Human Grader)
标记以供人工审查:
```markdown
[HUMAN REVIEW REQUIRED]
Change: Description of what changed
Reason: Why human review is needed
Risk Level: LOW/MEDIUM/HIGH
```

## 指标(Metrics)

### pass@k
“k 次尝试中至少成功一次”
- pass@1:首次尝试成功率
- pass@3:3 次尝试内成功
- 典型目标:pass@3 > 90%

### pass^k
“k 次试验全部成功”
- 更高的可靠性门槛
- pass^3:连续 3 次成功
- 用于关键路径(Critical Paths)

## 评测工作流

### 1. 定义(编码前)
```markdown
## EVAL DEFINITION: feature-xyz

### Capability Evals
1. Can create new user account
2. Can validate email format
3. Can hash password securely

##

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
4105 chars