This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.
View on GitHubguanyang/antigravity-skills
antigravity-skills
January 24, 2026
Select agents to install to:
npx add-skill https://github.com/guanyang/antigravity-skills/blob/main/skills/advanced-evaluation/SKILL.md -a claude-code --skill advanced-evaluationInstallation paths:
.claude/skills/advanced-evaluation/# Advanced Evaluation This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems. **Key insight**: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops. ## When to Activate Activate this skill when: - Building automated evaluation pipelines for LLM outputs - Comparing multiple model responses to select the best one - Establishing consistent quality standards across evaluation teams - Debugging evaluation systems that show inconsistent results - Designing A/B tests for prompt or model changes - Creating rubrics for human or automated evaluation - Analyzing correlation between automated and human judgments ## Core Concepts ### The Evaluation Taxonomy Evaluation approaches fall into two primary categories with distinct reliability profiles: **Direct Scoring**: A single LLM rates one response on a defined scale. - Best for: Objective criteria (factual accuracy, instruction following, toxicity) - Reliability: Moderate to high for well-defined criteria - Failure mode: Score calibration drift, inconsistent scale interpretation **Pairwise Comparison**: An LLM compares two responses and selects the better one. - Best for: Subjective preferences (tone, style, persuasiveness) - Reliability: Higher than direct scoring for preferences - Failure mode: Position bias, length bias Research from the MT-Bench paper (Zheng et al., 2023) establishes that pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation, while direct scoring remains appropriate for objective criteria with clear ground truth. ### The Bias Landscape LLM judges exhibit systematic bia