Evaluation frameworks and assessment methodologies
View on GitHubplugins/aai-quality/skills/evaluation-frameworks/SKILL.md
February 1, 2026
Select agents to install to:
npx add-skill https://github.com/the-answerai/alphaagent-team/blob/main/plugins/aai-quality/skills/evaluation-frameworks/SKILL.md -a claude-code --skill evaluation-frameworksInstallation paths:
.claude/skills/evaluation-frameworks/# Evaluation Frameworks Skill
Frameworks for evaluating software and AI systems.
## LLM Evaluation
### Response Quality
```markdown
## LLM Response Evaluation
### Accuracy
Does the response contain correct information?
Rubric (1-5):
5 - Completely accurate
4 - Mostly accurate, minor errors
3 - Partially accurate
2 - Significant errors
1 - Incorrect
### Relevance
Does the response address the question?
Rubric (1-5):
5 - Directly addresses all aspects
4 - Addresses main points
3 - Partially relevant
2 - Mostly off-topic
1 - Completely irrelevant
### Helpfulness
Does the response help the user?
Rubric (1-5):
5 - Extremely helpful, actionable
4 - Helpful with good guidance
3 - Somewhat helpful
2 - Minimally helpful
1 - Not helpful
```
### LLM-as-Judge
```typescript
// LLM-based evaluation
interface JudgePrompt {
criteria: string
rubric: string
task: string
response: string
}
const judgePrompt = `
You are evaluating an AI response. Score it 1-5 based on the criteria.
## Criteria
${criteria}
## Rubric
${rubric}
## Task
${task}
## Response to Evaluate
${response}
## Instructions
1. Consider each aspect of the rubric
2. Identify strengths and weaknesses
3. Provide a score from 1-5
4. Explain your reasoning
Output format:
Score: [1-5]
Reasoning: [explanation]
`
async function evaluateWithJudge(
response: string,
task: string,
criteria: string
): Promise<EvaluationResult> {
const judgeResponse = await llm.complete(
judgePrompt.replace('${response}', response)
.replace('${task}', task)
.replace('${criteria}', criteria)
)
return parseJudgeResponse(judgeResponse)
}
```
## Code Quality Evaluation
### Code Review Rubric
```markdown
## Code Review Evaluation
### Correctness
- Logic is sound
- Handles edge cases
- No obvious bugs
### Design
- Follows SOLID principles
- Appropriate abstractions
- Clean architecture
### Security
- No vulnerabilities
- Input validation
- Proper authentication
### Perfor