High-performance LLM inference with vLLM, quantization (AWQ, GPTQ, FP8), speculative decoding, and edge deployment. Use when optimizing inference latency, throughput, or memory.
View on GitHubJanuary 25, 2026
Select agents to install to:
npx add-skill https://github.com/yonatangross/skillforge-claude-plugin/blob/main/skills/high-performance-inference/SKILL.md -a claude-code --skill high-performance-inferenceInstallation paths:
.claude/skills/high-performance-inference/# High-Performance Inference
Optimize LLM inference for production with vLLM 0.14.x, quantization, and speculative decoding.
> **vLLM 0.14.0** (Jan 2026): PyTorch 2.9.0, CUDA 12.9, AttentionConfig API, Python 3.12+ recommended.
## Overview
- Deploying LLMs with low latency requirements
- Reducing GPU memory for larger models
- Maximizing throughput for batch inference
- Edge/mobile deployment with constrained resources
- Cost optimization through efficient hardware utilization
## Quick Reference
```bash
# Basic vLLM server
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 8192
# With quantization + speculative decoding
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--quantization awq \
--speculative-config '{"method": "ngram", "num_speculative_tokens": 5}' \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9
```
## vLLM 0.14.x Key Features
| Feature | Benefit |
|---------|---------|
| **PagedAttention** | Up to 24x throughput via efficient KV cache |
| **Continuous Batching** | Dynamic request batching for max utilization |
| **CUDA Graphs** | Fast model execution with graph capture |
| **Tensor Parallelism** | Scale across multiple GPUs |
| **Prefix Caching** | Reuse KV cache for shared prefixes |
| **AttentionConfig** | New API replacing VLLM_ATTENTION_BACKEND env |
| **Semantic Router** | vLLM SR v0.1 "Iris" for intelligent LLM routing |
## Python vLLM Integration
```python
from vllm import LLM, SamplingParams
# Initialize with optimization flags
llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
quantization="awq",
tensor_parallel_size=2,
gpu_memory_utilization=0.9,
enable_prefix_caching=True,
)
# Sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=1024,
)
# Generate
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
```
## Quantization