Back to Skills

multimodal-rag

verified

CLIP, SigLIP 2, Voyage multimodal-3 patterns for image+text retrieval, cross-modal search, and multimodal document chunking. Use when building RAG with images, implementing visual search, or hybrid retrieval.

View on GitHub

Marketplace

orchestkit

yonatangross/orchestkit

Plugin

ork-rag-advanced

ai

Repository

yonatangross/orchestkit
33stars

plugins/ork-rag-advanced/skills/multimodal-rag/SKILL.md

Last Verified

January 25, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/yonatangross/orchestkit/blob/main/plugins/ork-rag-advanced/skills/multimodal-rag/SKILL.md -a claude-code --skill multimodal-rag

Installation paths:

Claude
.claude/skills/multimodal-rag/
Powered by add-skill CLI

Instructions

# Multimodal RAG (2026)

Build retrieval-augmented generation systems that handle images, text, and mixed content.

## Overview

- Image + text retrieval (product search, documentation)
- Cross-modal search (text query -> image results)
- Multimodal document processing (PDFs with charts)
- Visual question answering with context
- Image similarity and deduplication
- Hybrid search pipelines

## Architecture Approaches

| Approach | Pros | Cons | Best For |
|----------|------|------|----------|
| **Joint Embedding** (CLIP) | Direct comparison | Limited context | Pure image search |
| **Caption-based** | Works with text LLMs | Lossy conversion | Existing text RAG |
| **Hybrid** | Best accuracy | More complex | Production systems |

## Embedding Models (2026)

| Model | Context | Modalities | Best For |
|-------|---------|------------|----------|
| **Voyage multimodal-3** | 32K tokens | Text + Image | Long documents |
| **SigLIP 2** | Standard | Text + Image | Large-scale retrieval |
| **CLIP ViT-L/14** | 77 tokens | Text + Image | General purpose |
| **ImageBind** | Standard | 6 modalities | Audio/video included |
| **ColPali** | Document | Text + Image | PDF/document RAG |

## CLIP-Based Image Embeddings

```python
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

# Load CLIP model
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

def embed_image(image_path: str) -> list[float]:
    """Generate CLIP embedding for an image."""
    image = Image.open(image_path)
    inputs = processor(images=image, return_tensors="pt")

    with torch.no_grad():
        embeddings = model.get_image_features(**inputs)

    # Normalize for cosine similarity
    embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
    return embeddings[0].tolist()

def embed_text(text: str) -> list[float]:
    """Generate CLIP embedding for text query."""
    inputs = 

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
10582 chars