Back to Skills

model-deployment

verified

LLM deployment strategies including vLLM, TGI, and cloud inference endpoints.

View on GitHub

Marketplace

pluginagentmarketplace-ai-engineer

pluginagentmarketplace/custom-plugin-ai-engineer

Plugin

ai-engineer-plugin

Repository

pluginagentmarketplace/custom-plugin-ai-engineer
2stars

skills/model-deployment/SKILL.md

Last Verified

January 20, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/pluginagentmarketplace/custom-plugin-ai-engineer/blob/main/skills/model-deployment/SKILL.md -a claude-code --skill model-deployment

Installation paths:

Claude
.claude/skills/model-deployment/
Powered by add-skill CLI

Instructions

# Model Deployment

Deploy LLMs to production with optimal performance.

## Quick Start

### vLLM Server
```bash
# Install
pip install vllm

# Start server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --port 8000 \
    --tensor-parallel-size 1

# Query (OpenAI-compatible)
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-chat-hf",
        "prompt": "Hello, how are you?",
        "max_tokens": 100
    }'
```

### Text Generation Inference (TGI)
```bash
# Docker deployment
docker run --gpus all -p 8080:80 \
    -v ./data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-2-7b-chat-hf \
    --quantize bitsandbytes-nf4 \
    --max-input-length 4096 \
    --max-total-tokens 8192

# Query
curl http://localhost:8080/generate \
    -H "Content-Type: application/json" \
    -d '{"inputs": "What is AI?", "parameters": {"max_new_tokens": 100}}'
```

### Ollama (Local Deployment)
```bash
# Install and run
curl -fsSL https://ollama.ai/install.sh | sh
ollama run llama2

# API usage
curl http://localhost:11434/api/generate -d '{
    "model": "llama2",
    "prompt": "Why is the sky blue?"
}'
```

## Deployment Options Comparison

| Platform | Ease | Cost | Scale | Latency | Best For |
|----------|------|------|-------|---------|----------|
| vLLM | ⭐⭐ | Self-host | High | Low | Production |
| TGI | ⭐⭐ | Self-host | High | Low | HuggingFace ecosystem |
| Ollama | ⭐⭐⭐ | Free | Low | Medium | Local dev |
| OpenAI | ⭐⭐⭐ | Pay-per-token | Very High | Low | Quick start |
| AWS Bedrock | ⭐⭐ | Pay-per-token | Very High | Medium | Enterprise |
| Replicate | ⭐⭐⭐ | Pay-per-second | High | Medium | Prototyping |

## FastAPI Inference Server

```python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
7453 chars