Use this skill when implementing unified understanding and generation models. Covers multi-task architectures, autoregressive vs diffusion approaches, multimodal tokenization, and next-token prediction paradigms.
View on GitHubyxbian23/ai-research-claude-code
everything-claude-code
skills/unified-generation/SKILL.md
January 25, 2026
Select agents to install to:
npx add-skill https://github.com/yxbian23/ai-research-claude-code/blob/main/skills/unified-generation/SKILL.md -a claude-code --skill unified-generationInstallation paths:
.claude/skills/unified-generation/# Unified Understanding and Generation
This skill provides guidance for implementing models that unify visual understanding and generation in a single framework.
## When to Activate
- Implementing unified vision-language models
- Designing multi-task architectures
- Choosing between autoregressive and diffusion approaches
- Working with multimodal tokenization
- Building systems that both understand and generate images
## Paradigm Overview
### Key Approaches
| Approach | Understanding | Generation | Examples |
|----------|---------------|------------|----------|
| **Separate Models** | CLIP + LLM | Diffusion | SD + LLaVA |
| **Unified AR** | Next-token | Next-token | Chameleon, Emu3 |
| **Unified Diffusion** | Encoder | Diffusion | DALL-E 3 |
| **Hybrid** | AR | AR + Diffusion | Show-o, Janus |
### Unified Autoregressive Approach
```
Text: "A cat on a sofa" → [text tokens] → Transformer → [image tokens] → Decoder → Image
Image: 🖼️ → Encoder → [image tokens] → Transformer → [text tokens] → "A cat on a sofa"
```
### Hybrid Approach (AR + Diffusion)
```
Understanding: Image → Visual Encoder → LLM → Text
Generation: Text → LLM → Diffusion Decoder → Image
```
## Visual Tokenization
### VQ-VAE / VQ-GAN Tokenizer
```python
class VQTokenizer(nn.Module):
"""Vector-quantized visual tokenizer."""
def __init__(
self,
vocab_size: int = 16384,
embed_dim: int = 256,
img_size: int = 256,
patch_size: int = 16,
):
super().__init__()
self.encoder = CNNEncoder(out_dim=embed_dim)
self.decoder = CNNDecoder(in_dim=embed_dim)
self.codebook = nn.Embedding(vocab_size, embed_dim)
self.vocab_size = vocab_size
def encode(self, x: torch.Tensor) -> torch.Tensor:
"""Encode image to discrete tokens."""
# x: (B, 3, H, W) -> z: (B, D, h, w)
z = self.encoder(x)
# Reshape for codebook lookup
z = rearrange(z, 'b d h w -> b (h w) d')
# Find nea