CLIP, SigLIP 2, Voyage multimodal-3 patterns for image+text retrieval, cross-modal search, and multimodal document chunking. Use when building RAG with images, implementing visual search, or hybrid retrieval.
View on GitHubyonatangross/orchestkit
ork
January 25, 2026
Select agents to install to:
npx add-skill https://github.com/yonatangross/orchestkit/blob/main/skills/multimodal-rag/SKILL.md -a claude-code --skill multimodal-ragInstallation paths:
.claude/skills/multimodal-rag/# Multimodal RAG (2026)
Build retrieval-augmented generation systems that handle images, text, and mixed content.
## Overview
- Image + text retrieval (product search, documentation)
- Cross-modal search (text query -> image results)
- Multimodal document processing (PDFs with charts)
- Visual question answering with context
- Image similarity and deduplication
- Hybrid search pipelines
## Architecture Approaches
| Approach | Pros | Cons | Best For |
|----------|------|------|----------|
| **Joint Embedding** (CLIP) | Direct comparison | Limited context | Pure image search |
| **Caption-based** | Works with text LLMs | Lossy conversion | Existing text RAG |
| **Hybrid** | Best accuracy | More complex | Production systems |
## Embedding Models (2026)
| Model | Context | Modalities | Best For |
|-------|---------|------------|----------|
| **Voyage multimodal-3** | 32K tokens | Text + Image | Long documents |
| **SigLIP 2** | Standard | Text + Image | Large-scale retrieval |
| **CLIP ViT-L/14** | 77 tokens | Text + Image | General purpose |
| **ImageBind** | Standard | 6 modalities | Audio/video included |
| **ColPali** | Document | Text + Image | PDF/document RAG |
## CLIP-Based Image Embeddings
```python
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
# Load CLIP model
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
def embed_image(image_path: str) -> list[float]:
"""Generate CLIP embedding for an image."""
image = Image.open(image_path)
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
embeddings = model.get_image_features(**inputs)
# Normalize for cosine similarity
embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()
def embed_text(text: str) -> list[float]:
"""Generate CLIP embedding for text query."""
inputs =