Use this skill when working with Vision-Language Models. Covers visual encoder selection (CLIP/SigLIP/EVA), vision-language alignment, instruction tuning, multi-image conversation, and VLM evaluation.
View on GitHubyxbian23/ai-research-claude-code
everything-claude-code
skills/vlm-workflow/SKILL.md
January 25, 2026
Select agents to install to:
npx add-skill https://github.com/yxbian23/ai-research-claude-code/blob/main/skills/vlm-workflow/SKILL.md -a claude-code --skill vlm-workflowInstallation paths:
.claude/skills/vlm-workflow/# Vision-Language Model Workflow
This skill provides comprehensive guidance for building and fine-tuning vision-language models.
## When to Activate
- Building vision-language models from scratch
- Fine-tuning existing VLMs (LLaVA, Qwen-VL, etc.)
- Choosing visual encoders
- Implementing vision-language alignment
- Evaluating VLM capabilities
## Visual Encoder Selection
### Comparison Table
| Encoder | Resolution | Tokens | Strengths | Use Case |
|---------|------------|--------|-----------|----------|
| CLIP ViT-L/14 | 224 | 256 | General vision-language | Standard VLM |
| CLIP ViT-L/14-336 | 336 | 576 | Better detail | Document/OCR |
| SigLIP | 384 | 729 | Better alignment | Modern VLMs |
| EVA-CLIP | 224-448 | 256-1024 | Strong features | Large VLMs |
| DINOv2 | 518 | 1369 | Dense features | Segmentation |
### Loading Visual Encoders
```python
from transformers import CLIPVisionModel, SiglipVisionModel
# CLIP
clip_encoder = CLIPVisionModel.from_pretrained(
"openai/clip-vit-large-patch14-336",
torch_dtype=torch.float16,
)
# SigLIP
siglip_encoder = SiglipVisionModel.from_pretrained(
"google/siglip-so400m-patch14-384",
torch_dtype=torch.float16,
)
# EVA-CLIP (via timm)
import timm
eva_encoder = timm.create_model(
"eva02_large_patch14_clip_224",
pretrained=True,
num_classes=0,
)
```
## Vision-Language Alignment
### Two-Stage Training
**Stage 1: Feature Alignment (Projector Only)**
```python
class VisionLanguageModel(nn.Module):
def __init__(self, vision_encoder, llm, projector_type="mlp"):
super().__init__()
self.vision_encoder = vision_encoder
self.llm = llm
vision_dim = vision_encoder.config.hidden_size
llm_dim = llm.config.hidden_size
if projector_type == "mlp":
self.projector = nn.Sequential(
nn.Linear(vision_dim, llm_dim),
nn.GELU(),
nn.Linear(llm_dim, llm_dim),
)
elif projector_type == "r