Fast inference with Unsloth and vLLM backend. Covers model loading, fast_generate(), thinking model output parsing, and memory management for efficient inference.
View on GitHubatrawog/bazzite-ai-plugins
bazzite-ai-jupyter
bazzite-ai-jupyter/skills/inference/SKILL.md
January 21, 2026
Select agents to install to:
npx add-skill https://github.com/atrawog/bazzite-ai-plugins/blob/main/bazzite-ai-jupyter/skills/inference/SKILL.md -a claude-code --skill inferenceInstallation paths:
.claude/skills/inference/# Fast Inference
## Overview
Unsloth provides optimized inference through the vLLM backend, enabling 2x faster generation compared to standard HuggingFace inference. This skill covers fast inference setup, thinking model output parsing, and memory management.
## Quick Reference
| Component | Purpose |
|-----------|---------|
| `fast_inference=True` | Enable vLLM backend for 2x speedup |
| `model.fast_generate()` | vLLM-accelerated generation |
| `SamplingParams` | Control generation (temperature, top_p, etc.) |
| `FastLanguageModel.for_inference()` | Merge LoRA adapters for inference |
| Token ID 151668 | `</think>` boundary for Qwen3-Thinking models |
## Critical Environment Setup
```python
import os
from dotenv import load_dotenv
load_dotenv()
```
## Critical Import Order
```python
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported
import torch
import vllm
from vllm import SamplingParams
```
## Environment Verification
Before inference, verify your environment is correctly configured:
```python
import unsloth
from unsloth import FastLanguageModel
import torch
import vllm
# Check versions
print(f"unsloth: {unsloth.__version__}")
print(f"vLLM: {vllm.__version__}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"CUDA version: {torch.version.cuda}")
```
## Standard Inference (No vLLM)
### Load Model
```python
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit",
max_seq_length=1024,
load_in_4bit=True,
)
# Prepare for inference (merges LoRA adapters if present)
FastLanguageModel.for_inference(model)
```
### Generate Response
```python
messages = [{"role": "user", "content": "What is machine learning?"}]
prompt = tokenizer.apply_chat_temp