High-performance LLM inference with vLLM, quantization (AWQ, GPTQ, FP8), speculative decoding, and edge deployment. Use when optimizing inference latency, throughput, or memory.
View on GitHubyonatangross/skillforge-claude-plugin
ork
January 25, 2026
Select agents to install to:
npx add-skill https://github.com/yonatangross/skillforge-claude-plugin/blob/main/skills/high-performance-inference/SKILL.md -a claude-code --skill high-performance-inferenceInstallation paths:
.claude/skills/high-performance-inference/# High-Performance Inference
Optimize LLM inference for production with vLLM 0.14.x, quantization, and speculative decoding.
> **vLLM 0.14.0** (Jan 2026): PyTorch 2.9.0, CUDA 12.9, AttentionConfig API, Python 3.12+ recommended.
## Overview
- Deploying LLMs with low latency requirements
- Reducing GPU memory for larger models
- Maximizing throughput for batch inference
- Edge/mobile deployment with constrained resources
- Cost optimization through efficient hardware utilization
## Quick Reference
```bash
# Basic vLLM server
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 8192
# With quantization + speculative decoding
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--quantization awq \
--speculative-config '{"method": "ngram", "num_speculative_tokens": 5}' \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9
```
## vLLM 0.14.x Key Features
| Feature | Benefit |
|---------|---------|
| **PagedAttention** | Up to 24x throughput via efficient KV cache |
| **Continuous Batching** | Dynamic request batching for max utilization |
| **CUDA Graphs** | Fast model execution with graph capture |
| **Tensor Parallelism** | Scale across multiple GPUs |
| **Prefix Caching** | Reuse KV cache for shared prefixes |
| **AttentionConfig** | New API replacing VLLM_ATTENTION_BACKEND env |
| **Semantic Router** | vLLM SR v0.1 "Iris" for intelligent LLM routing |
## Python vLLM Integration
```python
from vllm import LLM, SamplingParams
# Initialize with optimization flags
llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
quantization="awq",
tensor_parallel_size=2,
gpu_memory_utilization=0.9,
enable_prefix_caching=True,
)
# Sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=1024,
)
# Generate
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
```
## Quantization