Model quantization for efficient inference and training. Covers precision types (FP32, FP16, BF16, INT8, INT4), BitsAndBytes configuration, memory estimation, and performance tradeoffs.
View on GitHubatrawog/bazzite-ai-plugins
bazzite-ai-jupyter
bazzite-ai-jupyter/skills/quantization/SKILL.md
January 21, 2026
Select agents to install to:
npx add-skill https://github.com/atrawog/bazzite-ai-plugins/blob/main/bazzite-ai-jupyter/skills/quantization/SKILL.md -a claude-code --skill quantizationInstallation paths:
.claude/skills/quantization/# Model Quantization
## Overview
Quantization reduces model precision to save memory and speed up inference. A 7B model at FP32 requires ~28GB, but at 4-bit only ~4GB.
## Quick Reference
| Precision | Bits | Memory | Quality | Speed |
|-----------|------|--------|---------|-------|
| FP32 | 32 | 4x | Best | Slowest |
| FP16 | 16 | 2x | Excellent | Fast |
| BF16 | 16 | 2x | Excellent | Fast |
| INT8 | 8 | 1x | Good | Faster |
| INT4 | 4 | 0.5x | Acceptable | Fastest |
## Memory Estimation
```python
def estimate_memory(params_billions, precision_bits):
"""Estimate model memory in GB."""
bytes_per_param = precision_bits / 8
return params_billions * bytes_per_param
# Example: 7B model
model_size = 7 # billion parameters
print(f"FP32: {estimate_memory(7, 32):.1f} GB") # 28 GB
print(f"FP16: {estimate_memory(7, 16):.1f} GB") # 14 GB
print(f"INT8: {estimate_memory(7, 8):.1f} GB") # 7 GB
print(f"INT4: {estimate_memory(7, 4):.1f} GB") # 3.5 GB
```
## Measure Model Size
```python
def get_model_size(model):
"""Get model size in GB including buffers."""
param_size = sum(p.numel() * p.element_size() for p in model.parameters())
buffer_size = sum(b.numel() * b.element_size() for b in model.buffers())
total = (param_size + buffer_size) / 1024**3
return total
print(f"Model size: {get_model_size(model):.2f} GB")
```
## Load Model at Different Precisions
### FP32 (Default)
```python
from transformers import AutoModelForCausalLM
model_32bit = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
device_map="auto"
)
print(f"FP32 size: {get_model_size(model_32bit):.2f} GB")
```
### FP16 / BF16
```python
import torch
model_16bit = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
torch_dtype=torch.float16, # or torch.bfloat16
device_map="auto"
)
print(f"FP16 size: {get_model_size(model_16bit):.2f} GB")
```
### 8-bit Quantization
```python
from transformers impo