Validate datasets for Unsloth fine-tuning. Use when the user wants to check a dataset, analyze tokens, calculate Chinchilla optimality, or prepare data for training.
View on GitHubchrisvoncsefalvay/funsloth
funsloth
January 20, 2026
Select agents to install to:
npx add-skill https://github.com/chrisvoncsefalvay/funsloth/blob/main/skills/funsloth-check/SKILL.md -a claude-code --skill funsloth-checkInstallation paths:
.claude/skills/funsloth-check/# Dataset Validation for Unsloth Fine-tuning Validate datasets before fine-tuning with Unsloth. ## Quick Start For automated validation, use the script: ```bash python scripts/validate_dataset.py --dataset "dataset-id" --model llama-3.1-8b --lora-rank 16 ``` ## Workflow ### 1. Get Dataset Source Ask for: HF dataset ID (e.g., `mlabonne/FineTome-100k`) or local path (e.g., `./data.jsonl`) ### 2. Load and Detect Format Auto-detect format from structure. See [DATA_FORMATS.md](DATA_FORMATS.md) for details. | Format | Detection | Key Fields | |--------|-----------|------------| | Raw | `text` only | `text` | | Alpaca | `instruction` + `output` | `instruction`, `output` | | ShareGPT | `conversations` array | `from`, `value` | | ChatML | `messages` array | `role`, `content` | ### 3. Validate Schema Check required fields exist. Report issues with fix suggestions. ### 4. Show Samples Display 2-3 examples for visual verification. ### 5. Token Analysis Report statistics: total tokens, min/max/mean/median sequence length. Flag concerns: - Sequences > 4096 tokens - Sequences < 10 tokens ### 6. Chinchilla Analysis Ask for target model and LoRA rank, then calculate: | Chinchilla Fraction | Interpretation | |--------------------|----------------| | < 0.5x | Dataset may be too small | | 0.5x - 2.0x | Good range | | > 2.0x | Large dataset, may take longer | ### 7. Recommendations Based on analysis, suggest: - `standardize_sharegpt()` for ShareGPT data - Sequence length adjustments - Learning rate for small datasets ### 8. Optional: HF Upload Offer to upload local datasets to Hub. ### 9. Handoff Pass context to `funsloth-train`: ```yaml dataset_id: "mlabonne/FineTome-100k" format_type: "sharegpt" total_tokens: 15000000 target_model: "llama-3.1-8b" use_lora: true lora_rank: 16 chinchilla_fraction: 1.2 ``` ## Bundled Resources - [scripts/validate_dataset.py](scripts/validate_dataset.py) - Automated validation script - [DATA_FORMATS.md](DATA_FORMATS.md) - Dataset format