LLM and ML model deployment for inference. Use when serving models in production, building AI APIs, or optimizing inference. Covers vLLM (LLM serving), TensorRT-LLM (GPU optimization), Ollama (local), BentoML (ML deployment), Triton (multi-model), LangChain (orchestration), LlamaIndex (RAG), and streaming patterns.
View on GitHubancoleman/ai-design-components
backend-ai-skills
February 1, 2026
Select agents to install to:
npx add-skill https://github.com/ancoleman/ai-design-components/blob/main/skills/model-serving/SKILL.md -a claude-code --skill model-servingInstallation paths:
.claude/skills/model-serving/# Model Serving ## Purpose Deploy LLM and ML models for production inference with optimized serving engines, streaming response patterns, and orchestration frameworks. Focuses on self-hosted model serving, GPU optimization, and integration with frontend applications. ## When to Use - Deploying LLMs for production (self-hosted Llama, Mistral, Qwen) - Building AI APIs with streaming responses - Serving traditional ML models (scikit-learn, XGBoost, PyTorch) - Implementing RAG pipelines with vector databases - Optimizing inference throughput and latency - Integrating LLM serving with frontend chat interfaces ## Model Serving Selection ### LLM Serving Engines **vLLM (Recommended Primary)** - PagedAttention memory management (20-30x throughput improvement) - Continuous batching for dynamic request handling - OpenAI-compatible API endpoints - Use for: Most self-hosted LLM deployments **TensorRT-LLM** - Maximum GPU efficiency (2-8x faster than vLLM) - Requires model conversion and optimization - Use for: Production workloads needing absolute maximum throughput **Ollama** - Local development without GPUs - Simple CLI interface - Use for: Prototyping, laptop development, educational purposes **Decision Framework:** ``` Self-hosted LLM deployment needed? ├─ Yes, need maximum throughput → vLLM ├─ Yes, need absolute max GPU efficiency → TensorRT-LLM ├─ Yes, local development only → Ollama └─ No, use managed API (OpenAI, Anthropic) → No serving layer needed ``` ### ML Model Serving (Non-LLM) **BentoML (Recommended)** - Python-native, easy deployment - Adaptive batching for throughput - Multi-framework support (scikit-learn, PyTorch, XGBoost) - Use for: Most traditional ML model deployments **Triton Inference Server** - Multi-model serving on same GPU - Model ensembles (chain multiple models) - Use for: NVIDIA GPU optimization, serving 10+ models ### LLM Orchestration **LangChain** - General-purpose workflows, agents, RAG - 100+ integrations (LLMs, vector DBs, tools