Back to Skills

vision-language-models

verified

GPT-5/4o, Claude 4.5, Gemini 2.5/3, Grok 4 vision patterns for image analysis, document understanding, and visual QA. Use when implementing image captioning, document/chart analysis, or multi-image comparison.

View on GitHub

Marketplace

orchestkit

yonatangross/skillforge-claude-plugin

Plugin

ork

development

Repository

yonatangross/skillforge-claude-plugin
33stars

skills/vision-language-models/SKILL.md

Last Verified

January 25, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/yonatangross/skillforge-claude-plugin/blob/main/skills/vision-language-models/SKILL.md -a claude-code --skill vision-language-models

Installation paths:

Claude
.claude/skills/vision-language-models/
Powered by add-skill CLI

Instructions

# Vision Language Models (2026)

Integrate vision capabilities from leading multimodal models for image understanding, document analysis, and visual reasoning.

## Overview

- Image captioning and description generation
- Visual question answering (VQA)
- Document/chart/diagram analysis with OCR
- Multi-image comparison and reasoning
- Bounding box detection and region analysis
- Video frame analysis

## Model Comparison (January 2026)

| Model | Context | Strengths | Vision Input |
|-------|---------|-----------|--------------|
| **GPT-5.2** | 128K | Best general reasoning, multimodal | Up to 10 images |
| **Claude Opus 4.5** | 200K | Best coding, sustained agent tasks | Up to 100 images |
| **Gemini 2.5 Pro** | 1M+ | Longest context, video analysis | 3,600 images max |
| **Gemini 3 Pro** | 1M | Deep Think, 100% AIME 2025 | Enhanced segmentation |
| **Grok 4** | 2M | Real-time X integration, DeepSearch | Images + upcoming video |

## Image Input Methods

### Base64 Encoding (All Providers)

```python
import base64
import mimetypes

def encode_image_base64(image_path: str) -> tuple[str, str]:
    """Encode local image to base64 with MIME type."""
    mime_type, _ = mimetypes.guess_type(image_path)
    mime_type = mime_type or "image/png"

    with open(image_path, "rb") as f:
        base64_data = base64.standard_b64encode(f.read()).decode("utf-8")

    return base64_data, mime_type
```

### OpenAI GPT-5/4o Vision

```python
from openai import OpenAI

client = OpenAI()

def analyze_image_openai(image_path: str, prompt: str) -> str:
    """Analyze image using GPT-5 or GPT-4o."""
    base64_data, mime_type = encode_image_base64(image_path)

    response = client.chat.completions.create(
        model="gpt-5",  # or "gpt-4o", "gpt-4.1"
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {
                    "url": f"data:{mime_type};base64,{base64_

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
8154 chars