LLM deployment strategies including vLLM, TGI, and cloud inference endpoints.
View on GitHubpluginagentmarketplace/custom-plugin-ai-engineer
ai-engineer-plugin
January 20, 2026
Select agents to install to:
npx add-skill https://github.com/pluginagentmarketplace/custom-plugin-ai-engineer/blob/main/skills/model-deployment/SKILL.md -a claude-code --skill model-deploymentInstallation paths:
.claude/skills/model-deployment/# Model Deployment
Deploy LLMs to production with optimal performance.
## Quick Start
### vLLM Server
```bash
# Install
pip install vllm
# Start server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--port 8000 \
--tensor-parallel-size 1
# Query (OpenAI-compatible)
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"prompt": "Hello, how are you?",
"max_tokens": 100
}'
```
### Text Generation Inference (TGI)
```bash
# Docker deployment
docker run --gpus all -p 8080:80 \
-v ./data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-2-7b-chat-hf \
--quantize bitsandbytes-nf4 \
--max-input-length 4096 \
--max-total-tokens 8192
# Query
curl http://localhost:8080/generate \
-H "Content-Type: application/json" \
-d '{"inputs": "What is AI?", "parameters": {"max_new_tokens": 100}}'
```
### Ollama (Local Deployment)
```bash
# Install and run
curl -fsSL https://ollama.ai/install.sh | sh
ollama run llama2
# API usage
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?"
}'
```
## Deployment Options Comparison
| Platform | Ease | Cost | Scale | Latency | Best For |
|----------|------|------|-------|---------|----------|
| vLLM | ⭐⭐ | Self-host | High | Low | Production |
| TGI | ⭐⭐ | Self-host | High | Low | HuggingFace ecosystem |
| Ollama | ⭐⭐⭐ | Free | Low | Medium | Local dev |
| OpenAI | ⭐⭐⭐ | Pay-per-token | Very High | Low | Quick start |
| AWS Bedrock | ⭐⭐ | Pay-per-token | Very High | Medium | Enterprise |
| Replicate | ⭐⭐⭐ | Pay-per-second | High | Medium | Prototyping |
## FastAPI Inference Server
```python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
app = FastAPI()