inference

# Fast Inference

## Overview

Unsloth provides optimized inference through the vLLM backend, enabling 2x faster generation compared to standard HuggingFace inference. This skill covers fast inference setup, thinking model output parsing, and memory management.

## Quick Reference

| Component | Purpose |
|-----------|---------|
| `fast_inference=True` | Enable vLLM backend for 2x speedup |
| `model.fast_generate()` | vLLM-accelerated generation |
| `SamplingParams` | Control generation (temperature, top_p, etc.) |
| `FastLanguageModel.for_inference()` | Merge LoRA adapters for inference |
| Token ID 151668 | `</think>` boundary for Qwen3-Thinking models |

## Critical Environment Setup

```python
import os
from dotenv import load_dotenv
load_dotenv()
```

## Critical Import Order

```python
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

import torch
import vllm
from vllm import SamplingParams
```

## Environment Verification

Before inference, verify your environment is correctly configured:

```python
import unsloth
from unsloth import FastLanguageModel
import torch
import vllm

# Check versions
print(f"unsloth: {unsloth.__version__}")
print(f"vLLM: {vllm.__version__}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")
```

## Standard Inference (No vLLM)

### Load Model

```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit",
    max_seq_length=1024,
    load_in_4bit=True,
)

# Prepare for inference (merges LoRA adapters if present)
FastLanguageModel.for_inference(model)
```

### Generate Response

```python
messages = [{"role": "user", "content": "What is machine learning?"}]
prompt = tokenizer.apply_chat_temp
Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details