Back to Skills

vlm-workflow

verified

Use this skill when working with Vision-Language Models. Covers visual encoder selection (CLIP/SigLIP/EVA), vision-language alignment, instruction tuning, multi-image conversation, and VLM evaluation.

View on GitHub

Marketplace

everything-claude-code

yxbian23/ai-research-claude-code

Plugin

everything-claude-code

workflow

Repository

yxbian23/ai-research-claude-code

skills/vlm-workflow/SKILL.md

Last Verified

January 25, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/yxbian23/ai-research-claude-code/blob/main/skills/vlm-workflow/SKILL.md -a claude-code --skill vlm-workflow

Installation paths:

Claude
.claude/skills/vlm-workflow/
Powered by add-skill CLI

Instructions

# Vision-Language Model Workflow

This skill provides comprehensive guidance for building and fine-tuning vision-language models.

## When to Activate

- Building vision-language models from scratch
- Fine-tuning existing VLMs (LLaVA, Qwen-VL, etc.)
- Choosing visual encoders
- Implementing vision-language alignment
- Evaluating VLM capabilities

## Visual Encoder Selection

### Comparison Table

| Encoder | Resolution | Tokens | Strengths | Use Case |
|---------|------------|--------|-----------|----------|
| CLIP ViT-L/14 | 224 | 256 | General vision-language | Standard VLM |
| CLIP ViT-L/14-336 | 336 | 576 | Better detail | Document/OCR |
| SigLIP | 384 | 729 | Better alignment | Modern VLMs |
| EVA-CLIP | 224-448 | 256-1024 | Strong features | Large VLMs |
| DINOv2 | 518 | 1369 | Dense features | Segmentation |

### Loading Visual Encoders

```python
from transformers import CLIPVisionModel, SiglipVisionModel

# CLIP
clip_encoder = CLIPVisionModel.from_pretrained(
    "openai/clip-vit-large-patch14-336",
    torch_dtype=torch.float16,
)

# SigLIP
siglip_encoder = SiglipVisionModel.from_pretrained(
    "google/siglip-so400m-patch14-384",
    torch_dtype=torch.float16,
)

# EVA-CLIP (via timm)
import timm
eva_encoder = timm.create_model(
    "eva02_large_patch14_clip_224",
    pretrained=True,
    num_classes=0,
)
```

## Vision-Language Alignment

### Two-Stage Training

**Stage 1: Feature Alignment (Projector Only)**
```python
class VisionLanguageModel(nn.Module):
    def __init__(self, vision_encoder, llm, projector_type="mlp"):
        super().__init__()
        self.vision_encoder = vision_encoder
        self.llm = llm

        vision_dim = vision_encoder.config.hidden_size
        llm_dim = llm.config.hidden_size

        if projector_type == "mlp":
            self.projector = nn.Sequential(
                nn.Linear(vision_dim, llm_dim),
                nn.GELU(),
                nn.Linear(llm_dim, llm_dim),
            )
        elif projector_type == "r

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
9594 chars