Back to Skills

high-performance-inference

verified

High-performance LLM inference with vLLM, quantization (AWQ, GPTQ, FP8), speculative decoding, and edge deployment. Use when optimizing inference latency, throughput, or memory.

View on GitHub

Marketplace

orchestkit

yonatangross/skillforge-claude-plugin

Plugin

ork-llm-advanced

ai

Repository

yonatangross/skillforge-claude-plugin
33stars

plugins/ork-llm-advanced/skills/high-performance-inference/SKILL.md

Last Verified

January 25, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/yonatangross/skillforge-claude-plugin/blob/main/plugins/ork-llm-advanced/skills/high-performance-inference/SKILL.md -a claude-code --skill high-performance-inference

Installation paths:

Claude
.claude/skills/high-performance-inference/
Powered by add-skill CLI

Instructions

# High-Performance Inference

Optimize LLM inference for production with vLLM 0.14.x, quantization, and speculative decoding.

> **vLLM 0.14.0** (Jan 2026): PyTorch 2.9.0, CUDA 12.9, AttentionConfig API, Python 3.12+ recommended.

## Overview

- Deploying LLMs with low latency requirements
- Reducing GPU memory for larger models
- Maximizing throughput for batch inference
- Edge/mobile deployment with constrained resources
- Cost optimization through efficient hardware utilization

## Quick Reference

```bash
# Basic vLLM server
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 8192

# With quantization + speculative decoding
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --quantization awq \
    --speculative-config '{"method": "ngram", "num_speculative_tokens": 5}' \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9
```

## vLLM 0.14.x Key Features

| Feature | Benefit |
|---------|---------|
| **PagedAttention** | Up to 24x throughput via efficient KV cache |
| **Continuous Batching** | Dynamic request batching for max utilization |
| **CUDA Graphs** | Fast model execution with graph capture |
| **Tensor Parallelism** | Scale across multiple GPUs |
| **Prefix Caching** | Reuse KV cache for shared prefixes |
| **AttentionConfig** | New API replacing VLLM_ATTENTION_BACKEND env |
| **Semantic Router** | vLLM SR v0.1 "Iris" for intelligent LLM routing |

## Python vLLM Integration

```python
from vllm import LLM, SamplingParams

# Initialize with optimization flags
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    quantization="awq",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9,
    enable_prefix_caching=True,
)

# Sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024,
)

# Generate
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)
```

## Quantization 

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
5676 chars