unified-generation

# Unified Understanding and Generation

This skill provides guidance for implementing models that unify visual understanding and generation in a single framework.

## When to Activate

- Implementing unified vision-language models
- Designing multi-task architectures
- Choosing between autoregressive and diffusion approaches
- Working with multimodal tokenization
- Building systems that both understand and generate images

## Paradigm Overview

### Key Approaches

| Approach | Understanding | Generation | Examples |
|----------|---------------|------------|----------|
| **Separate Models** | CLIP + LLM | Diffusion | SD + LLaVA |
| **Unified AR** | Next-token | Next-token | Chameleon, Emu3 |
| **Unified Diffusion** | Encoder | Diffusion | DALL-E 3 |
| **Hybrid** | AR | AR + Diffusion | Show-o, Janus |

### Unified Autoregressive Approach

```
Text: "A cat on a sofa" → [text tokens] → Transformer → [image tokens] → Decoder → Image

Image: 🖼️ → Encoder → [image tokens] → Transformer → [text tokens] → "A cat on a sofa"
```

### Hybrid Approach (AR + Diffusion)

```
Understanding: Image → Visual Encoder → LLM → Text
Generation: Text → LLM → Diffusion Decoder → Image
```

## Visual Tokenization

### VQ-VAE / VQ-GAN Tokenizer

```python
class VQTokenizer(nn.Module):
    """Vector-quantized visual tokenizer."""

    def __init__(
        self,
        vocab_size: int = 16384,
        embed_dim: int = 256,
        img_size: int = 256,
        patch_size: int = 16,
    ):
        super().__init__()
        self.encoder = CNNEncoder(out_dim=embed_dim)
        self.decoder = CNNDecoder(in_dim=embed_dim)
        self.codebook = nn.Embedding(vocab_size, embed_dim)
        self.vocab_size = vocab_size

    def encode(self, x: torch.Tensor) -> torch.Tensor:
        """Encode image to discrete tokens."""
        # x: (B, 3, H, W) -> z: (B, D, h, w)
        z = self.encoder(x)

        # Reshape for codebook lookup
        z = rearrange(z, 'b d h w -> b (h w) d')

        # Find nea
Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details