Back to Skills

quantization

verified

Model quantization for efficient inference and training. Covers precision types (FP32, FP16, BF16, INT8, INT4), BitsAndBytes configuration, memory estimation, and performance tradeoffs.

View on GitHub

Marketplace

bazzite-ai-plugins

atrawog/bazzite-ai-plugins

Plugin

bazzite-ai-jupyter

development

Repository

atrawog/bazzite-ai-plugins

bazzite-ai-jupyter/skills/quantization/SKILL.md

Last Verified

January 21, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/atrawog/bazzite-ai-plugins/blob/main/bazzite-ai-jupyter/skills/quantization/SKILL.md -a claude-code --skill quantization

Installation paths:

Claude
.claude/skills/quantization/
Powered by add-skill CLI

Instructions

# Model Quantization

## Overview

Quantization reduces model precision to save memory and speed up inference. A 7B model at FP32 requires ~28GB, but at 4-bit only ~4GB.

## Quick Reference

| Precision | Bits | Memory | Quality | Speed |
|-----------|------|--------|---------|-------|
| FP32 | 32 | 4x | Best | Slowest |
| FP16 | 16 | 2x | Excellent | Fast |
| BF16 | 16 | 2x | Excellent | Fast |
| INT8 | 8 | 1x | Good | Faster |
| INT4 | 4 | 0.5x | Acceptable | Fastest |

## Memory Estimation

```python
def estimate_memory(params_billions, precision_bits):
    """Estimate model memory in GB."""
    bytes_per_param = precision_bits / 8
    return params_billions * bytes_per_param

# Example: 7B model
model_size = 7  # billion parameters

print(f"FP32: {estimate_memory(7, 32):.1f} GB")  # 28 GB
print(f"FP16: {estimate_memory(7, 16):.1f} GB")  # 14 GB
print(f"INT8: {estimate_memory(7, 8):.1f} GB")   # 7 GB
print(f"INT4: {estimate_memory(7, 4):.1f} GB")   # 3.5 GB
```

## Measure Model Size

```python
def get_model_size(model):
    """Get model size in GB including buffers."""
    param_size = sum(p.numel() * p.element_size() for p in model.parameters())
    buffer_size = sum(b.numel() * b.element_size() for b in model.buffers())
    total = (param_size + buffer_size) / 1024**3
    return total

print(f"Model size: {get_model_size(model):.2f} GB")
```

## Load Model at Different Precisions

### FP32 (Default)

```python
from transformers import AutoModelForCausalLM

model_32bit = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    device_map="auto"
)

print(f"FP32 size: {get_model_size(model_32bit):.2f} GB")
```

### FP16 / BF16

```python
import torch

model_16bit = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    torch_dtype=torch.float16,  # or torch.bfloat16
    device_map="auto"
)

print(f"FP16 size: {get_model_size(model_16bit):.2f} GB")
```

### 8-bit Quantization

```python
from transformers impo

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
8079 chars