Vision model fine-tuning with FastVisionModel. Covers Pixtral, Ministral VL training, UnslothVisionDataCollator, image+text datasets, and vision-specific LoRA configuration.
View on GitHubatrawog/bazzite-ai-plugins
bazzite-ai-jupyter
bazzite-ai-jupyter/skills/vision/SKILL.md
January 21, 2026
Select agents to install to:
npx add-skill https://github.com/atrawog/bazzite-ai-plugins/blob/main/bazzite-ai-jupyter/skills/vision/SKILL.md -a claude-code --skill visionInstallation paths:
.claude/skills/vision/# Vision Model Fine-Tuning
## Overview
Unsloth provides `FastVisionModel` for fine-tuning vision-language models (VLMs) like Pixtral and Ministral with 2x faster training. This skill covers vision model loading, dataset preparation with images, and vision-specific LoRA configuration.
## Quick Reference
| Component | Purpose |
|-----------|---------|
| `FastVisionModel` | Load vision models with Unsloth optimizations |
| `UnslothVisionDataCollator` | Handle image+text modality in batches |
| `finetune_vision_layers` | Enable training of vision encoder |
| `finetune_language_layers` | Enable training of language model |
| `skip_prepare_dataset=True` | Required for vision datasets |
| `dataset_text_field=""` | Empty string for vision (not a field name) |
| List dataset format | Use `[convert(s) for s in dataset]`, not `.map()` |
## Critical Environment Setup
```python
import os
from dotenv import load_dotenv
load_dotenv()
# Force text-based progress in Jupyter
os.environ["TQDM_NOTEBOOK"] = "false"
```
## Critical Import Order
```python
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastVisionModel, is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch
```
## Supported Vision Models
| Model | Path | Parameters | Best For |
|-------|------|------------|----------|
| Pixtral-12B | `unsloth/pixtral-12b-2409-bnb-4bit` | 12.7B | High-quality vision tasks |
| Ministral-8B-Vision | `unsloth/Ministral-8B-Vision-2507-bnb-4bit` | 8B | Balanced quality/speed |
| Llama-3.2-11B-Vision | `unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit` | 11B | General vision tasks |
## Load Vision Model
```python
from unsloth import FastVisionModel, is_bf16_supported
model, tokenizer = FastVisionModel.from_pretrained(
"unsloth/pixtral-12b-2409-bnb-4bit",
load_in_4bit=True,
use_gradient_checkpointing="unsloth",
)
print(f"Model l