Use when creating or improving golden datasets for AI evaluation. Defines quality criteria, curation workflows, and multi-agent analysis patterns for test data.
View on GitHubyonatangross/orchestkit
ork
January 25, 2026
Select agents to install to:
npx add-skill https://github.com/yonatangross/orchestkit/blob/main/skills/golden-dataset-curation/SKILL.md -a claude-code --skill golden-dataset-curationInstallation paths:
.claude/skills/golden-dataset-curation/# Golden Dataset Curation **Curate high-quality documents for the golden dataset with multi-agent validation** ## Overview This skill provides patterns and workflows for **adding new documents** to the golden dataset with thorough quality analysis. It complements `golden-dataset-management` which handles backup/restore. **When to use this skill:** - Adding new documents to the golden dataset - Classifying content types and difficulty levels - Generating test queries for new documents - Running multi-agent quality analysis --- ## Content Types | Type | Description | Quality Focus | |------|-------------|---------------| | `article` | Technical articles, blog posts | Depth, accuracy, actionability | | `tutorial` | Step-by-step guides | Completeness, clarity, code quality | | `research_paper` | Academic papers, whitepapers | Rigor, citations, methodology | | `documentation` | API docs, reference materials | Accuracy, completeness, examples | | `video_transcript` | Transcribed video content | Structure, coherence, key points | | `code_repository` | README, code analysis | Code quality, documentation | --- ## Difficulty Levels | Level | Semantic Complexity | Expected Score | Characteristics | |-------|---------------------|----------------|-----------------| | **trivial** | Direct keyword match | >0.85 | Technical terms, exact phrases | | **easy** | Common synonyms | >0.70 | Well-known concepts, slight variations | | **medium** | Paraphrased intent | >0.55 | Conceptual queries, multi-topic | | **hard** | Multi-hop reasoning | >0.40 | Cross-domain, comparative analysis | | **adversarial** | Edge cases | Graceful degradation | Robustness tests, off-domain | --- ## Quality Dimensions | Dimension | Weight | Perfect | Acceptable | Failing | |-----------|--------|---------|------------|---------| | **Accuracy** | 0.25 | 0.95-1.0 | 0.70-0.94 | <0.70 | | **Coherence** | 0.20 | 0.90-1.0 | 0.60-0.89 | <0.60 | | **Depth** | 0.25 | 0.90-1.0 | 0.55-0.89 | <0.55 | | **Rel