quantization

# Model Quantization

## Overview

Quantization reduces model precision to save memory and speed up inference. A 7B model at FP32 requires ~28GB, but at 4-bit only ~4GB.

## Quick Reference

| Precision | Bits | Memory | Quality | Speed |
|-----------|------|--------|---------|-------|
| FP32 | 32 | 4x | Best | Slowest |
| FP16 | 16 | 2x | Excellent | Fast |
| BF16 | 16 | 2x | Excellent | Fast |
| INT8 | 8 | 1x | Good | Faster |
| INT4 | 4 | 0.5x | Acceptable | Fastest |

## Memory Estimation

```python
def estimate_memory(params_billions, precision_bits):
    """Estimate model memory in GB."""
    bytes_per_param = precision_bits / 8
    return params_billions * bytes_per_param

# Example: 7B model
model_size = 7  # billion parameters

print(f"FP32: {estimate_memory(7, 32):.1f} GB")  # 28 GB
print(f"FP16: {estimate_memory(7, 16):.1f} GB")  # 14 GB
print(f"INT8: {estimate_memory(7, 8):.1f} GB")   # 7 GB
print(f"INT4: {estimate_memory(7, 4):.1f} GB")   # 3.5 GB
```

## Measure Model Size

```python
def get_model_size(model):
    """Get model size in GB including buffers."""
    param_size = sum(p.numel() * p.element_size() for p in model.parameters())
    buffer_size = sum(b.numel() * b.element_size() for b in model.buffers())
    total = (param_size + buffer_size) / 1024**3
    return total

print(f"Model size: {get_model_size(model):.2f} GB")
```

## Load Model at Different Precisions

### FP32 (Default)

```python
from transformers import AutoModelForCausalLM

model_32bit = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    device_map="auto"
)

print(f"FP32 size: {get_model_size(model_32bit):.2f} GB")
```

### FP16 / BF16

```python
import torch

model_16bit = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    torch_dtype=torch.float16,  # or torch.bfloat16
    device_map="auto"
)

print(f"FP16 size: {get_model_size(model_16bit):.2f} GB")
```

### 8-bit Quantization

```python
from transformers impo
Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details