Back to Skills

unified-generation

verified

Use this skill when implementing unified understanding and generation models. Covers multi-task architectures, autoregressive vs diffusion approaches, multimodal tokenization, and next-token prediction paradigms.

View on GitHub

Marketplace

everything-claude-code

yxbian23/ai-research-claude-code

Plugin

everything-claude-code

workflow

Repository

yxbian23/ai-research-claude-code

skills/unified-generation/SKILL.md

Last Verified

January 25, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/yxbian23/ai-research-claude-code/blob/main/skills/unified-generation/SKILL.md -a claude-code --skill unified-generation

Installation paths:

Claude
.claude/skills/unified-generation/
Powered by add-skill CLI

Instructions

# Unified Understanding and Generation

This skill provides guidance for implementing models that unify visual understanding and generation in a single framework.

## When to Activate

- Implementing unified vision-language models
- Designing multi-task architectures
- Choosing between autoregressive and diffusion approaches
- Working with multimodal tokenization
- Building systems that both understand and generate images

## Paradigm Overview

### Key Approaches

| Approach | Understanding | Generation | Examples |
|----------|---------------|------------|----------|
| **Separate Models** | CLIP + LLM | Diffusion | SD + LLaVA |
| **Unified AR** | Next-token | Next-token | Chameleon, Emu3 |
| **Unified Diffusion** | Encoder | Diffusion | DALL-E 3 |
| **Hybrid** | AR | AR + Diffusion | Show-o, Janus |

### Unified Autoregressive Approach

```
Text: "A cat on a sofa" → [text tokens] → Transformer → [image tokens] → Decoder → Image

Image: 🖼️ → Encoder → [image tokens] → Transformer → [text tokens] → "A cat on a sofa"
```

### Hybrid Approach (AR + Diffusion)

```
Understanding: Image → Visual Encoder → LLM → Text
Generation: Text → LLM → Diffusion Decoder → Image
```

## Visual Tokenization

### VQ-VAE / VQ-GAN Tokenizer

```python
class VQTokenizer(nn.Module):
    """Vector-quantized visual tokenizer."""

    def __init__(
        self,
        vocab_size: int = 16384,
        embed_dim: int = 256,
        img_size: int = 256,
        patch_size: int = 16,
    ):
        super().__init__()
        self.encoder = CNNEncoder(out_dim=embed_dim)
        self.decoder = CNNDecoder(in_dim=embed_dim)
        self.codebook = nn.Embedding(vocab_size, embed_dim)
        self.vocab_size = vocab_size

    def encode(self, x: torch.Tensor) -> torch.Tensor:
        """Encode image to discrete tokens."""
        # x: (B, 3, H, W) -> z: (B, D, h, w)
        z = self.encoder(x)

        # Reshape for codebook lookup
        z = rearrange(z, 'b d h w -> b (h w) d')

        # Find nea

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
11929 chars