Back to Skills

inference

verified

Fast inference with Unsloth and vLLM backend. Covers model loading, fast_generate(), thinking model output parsing, and memory management for efficient inference.

View on GitHub

Marketplace

bazzite-ai-plugins

atrawog/bazzite-ai-plugins

Plugin

bazzite-ai-jupyter

development

Repository

atrawog/bazzite-ai-plugins

bazzite-ai-jupyter/skills/inference/SKILL.md

Last Verified

January 21, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/atrawog/bazzite-ai-plugins/blob/main/bazzite-ai-jupyter/skills/inference/SKILL.md -a claude-code --skill inference

Installation paths:

Claude
.claude/skills/inference/
Powered by add-skill CLI

Instructions

# Fast Inference

## Overview

Unsloth provides optimized inference through the vLLM backend, enabling 2x faster generation compared to standard HuggingFace inference. This skill covers fast inference setup, thinking model output parsing, and memory management.

## Quick Reference

| Component | Purpose |
|-----------|---------|
| `fast_inference=True` | Enable vLLM backend for 2x speedup |
| `model.fast_generate()` | vLLM-accelerated generation |
| `SamplingParams` | Control generation (temperature, top_p, etc.) |
| `FastLanguageModel.for_inference()` | Merge LoRA adapters for inference |
| Token ID 151668 | `</think>` boundary for Qwen3-Thinking models |

## Critical Environment Setup

```python
import os
from dotenv import load_dotenv
load_dotenv()
```

## Critical Import Order

```python
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

import torch
import vllm
from vllm import SamplingParams
```

## Environment Verification

Before inference, verify your environment is correctly configured:

```python
import unsloth
from unsloth import FastLanguageModel
import torch
import vllm

# Check versions
print(f"unsloth: {unsloth.__version__}")
print(f"vLLM: {vllm.__version__}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")
```

## Standard Inference (No vLLM)

### Load Model

```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit",
    max_seq_length=1024,
    load_in_4bit=True,
)

# Prepare for inference (merges LoRA adapters if present)
FastLanguageModel.for_inference(model)
```

### Generate Response

```python
messages = [{"role": "user", "content": "What is machine learning?"}]
prompt = tokenizer.apply_chat_temp

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
11074 chars