Back to Skills

model-serving

verified

LLM and ML model deployment for inference. Use when serving models in production, building AI APIs, or optimizing inference. Covers vLLM (LLM serving), TensorRT-LLM (GPU optimization), Ollama (local), BentoML (ML deployment), Triton (multi-model), LangChain (orchestration), LlamaIndex (RAG), and streaming patterns.

View on GitHub

Marketplace

ai-design-components

ancoleman/ai-design-components

Plugin

backend-ai-skills

Repository

ancoleman/ai-design-components
153stars

skills/model-serving/SKILL.md

Last Verified

February 1, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/ancoleman/ai-design-components/blob/main/skills/model-serving/SKILL.md -a claude-code --skill model-serving

Installation paths:

Claude
.claude/skills/model-serving/
Powered by add-skill CLI

Instructions

# Model Serving

## Purpose

Deploy LLM and ML models for production inference with optimized serving engines, streaming response patterns, and orchestration frameworks. Focuses on self-hosted model serving, GPU optimization, and integration with frontend applications.

## When to Use

- Deploying LLMs for production (self-hosted Llama, Mistral, Qwen)
- Building AI APIs with streaming responses
- Serving traditional ML models (scikit-learn, XGBoost, PyTorch)
- Implementing RAG pipelines with vector databases
- Optimizing inference throughput and latency
- Integrating LLM serving with frontend chat interfaces

## Model Serving Selection

### LLM Serving Engines

**vLLM (Recommended Primary)**
- PagedAttention memory management (20-30x throughput improvement)
- Continuous batching for dynamic request handling
- OpenAI-compatible API endpoints
- Use for: Most self-hosted LLM deployments

**TensorRT-LLM**
- Maximum GPU efficiency (2-8x faster than vLLM)
- Requires model conversion and optimization
- Use for: Production workloads needing absolute maximum throughput

**Ollama**
- Local development without GPUs
- Simple CLI interface
- Use for: Prototyping, laptop development, educational purposes

**Decision Framework:**
```
Self-hosted LLM deployment needed?
├─ Yes, need maximum throughput → vLLM
├─ Yes, need absolute max GPU efficiency → TensorRT-LLM
├─ Yes, local development only → Ollama
└─ No, use managed API (OpenAI, Anthropic) → No serving layer needed
```

### ML Model Serving (Non-LLM)

**BentoML (Recommended)**
- Python-native, easy deployment
- Adaptive batching for throughput
- Multi-framework support (scikit-learn, PyTorch, XGBoost)
- Use for: Most traditional ML model deployments

**Triton Inference Server**
- Multi-model serving on same GPU
- Model ensembles (chain multiple models)
- Use for: NVIDIA GPU optimization, serving 10+ models

### LLM Orchestration

**LangChain**
- General-purpose workflows, agents, RAG
- 100+ integrations (LLMs, vector DBs, tools

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
12382 chars