Back to Skills

funsloth-check

verified

Validate datasets for Unsloth fine-tuning. Use when the user wants to check a dataset, analyze tokens, calculate Chinchilla optimality, or prepare data for training.

View on GitHub

Marketplace

funsloth

chrisvoncsefalvay/funsloth

Plugin

funsloth

machine-learning

Repository

chrisvoncsefalvay/funsloth
4stars

skills/funsloth-check/SKILL.md

Last Verified

January 20, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/chrisvoncsefalvay/funsloth/blob/main/skills/funsloth-check/SKILL.md -a claude-code --skill funsloth-check

Installation paths:

Claude
.claude/skills/funsloth-check/
Powered by add-skill CLI

Instructions

# Dataset Validation for Unsloth Fine-tuning

Validate datasets before fine-tuning with Unsloth.

## Quick Start

For automated validation, use the script:
```bash
python scripts/validate_dataset.py --dataset "dataset-id" --model llama-3.1-8b --lora-rank 16
```

## Workflow

### 1. Get Dataset Source
Ask for: HF dataset ID (e.g., `mlabonne/FineTome-100k`) or local path (e.g., `./data.jsonl`)

### 2. Load and Detect Format
Auto-detect format from structure. See [DATA_FORMATS.md](DATA_FORMATS.md) for details.

| Format | Detection | Key Fields |
|--------|-----------|------------|
| Raw | `text` only | `text` |
| Alpaca | `instruction` + `output` | `instruction`, `output` |
| ShareGPT | `conversations` array | `from`, `value` |
| ChatML | `messages` array | `role`, `content` |

### 3. Validate Schema
Check required fields exist. Report issues with fix suggestions.

### 4. Show Samples
Display 2-3 examples for visual verification.

### 5. Token Analysis
Report statistics: total tokens, min/max/mean/median sequence length.

Flag concerns:
- Sequences > 4096 tokens
- Sequences < 10 tokens

### 6. Chinchilla Analysis
Ask for target model and LoRA rank, then calculate:

| Chinchilla Fraction | Interpretation |
|--------------------|----------------|
| < 0.5x | Dataset may be too small |
| 0.5x - 2.0x | Good range |
| > 2.0x | Large dataset, may take longer |

### 7. Recommendations
Based on analysis, suggest:
- `standardize_sharegpt()` for ShareGPT data
- Sequence length adjustments
- Learning rate for small datasets

### 8. Optional: HF Upload
Offer to upload local datasets to Hub.

### 9. Handoff
Pass context to `funsloth-train`:
```yaml
dataset_id: "mlabonne/FineTome-100k"
format_type: "sharegpt"
total_tokens: 15000000
target_model: "llama-3.1-8b"
use_lora: true
lora_rank: 16
chinchilla_fraction: 1.2
```

## Bundled Resources

- [scripts/validate_dataset.py](scripts/validate_dataset.py) - Automated validation script
- [DATA_FORMATS.md](DATA_FORMATS.md) - Dataset format 

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
1991 chars