model-deployment

# Model Deployment

Deploy LLMs to production with optimal performance.

## Quick Start

### vLLM Server
```bash
# Install
pip install vllm

# Start server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --port 8000 \
    --tensor-parallel-size 1

# Query (OpenAI-compatible)
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-chat-hf",
        "prompt": "Hello, how are you?",
        "max_tokens": 100
    }'
```

### Text Generation Inference (TGI)
```bash
# Docker deployment
docker run --gpus all -p 8080:80 \
    -v ./data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-2-7b-chat-hf \
    --quantize bitsandbytes-nf4 \
    --max-input-length 4096 \
    --max-total-tokens 8192

# Query
curl http://localhost:8080/generate \
    -H "Content-Type: application/json" \
    -d '{"inputs": "What is AI?", "parameters": {"max_new_tokens": 100}}'
```

### Ollama (Local Deployment)
```bash
# Install and run
curl -fsSL https://ollama.ai/install.sh | sh
ollama run llama2

# API usage
curl http://localhost:11434/api/generate -d '{
    "model": "llama2",
    "prompt": "Why is the sky blue?"
}'
```

## Deployment Options Comparison

| Platform | Ease | Cost | Scale | Latency | Best For |
|----------|------|------|-------|---------|----------|
| vLLM | ⭐⭐ | Self-host | High | Low | Production |
| TGI | ⭐⭐ | Self-host | High | Low | HuggingFace ecosystem |
| Ollama | ⭐⭐⭐ | Free | Low | Medium | Local dev |
| OpenAI | ⭐⭐⭐ | Pay-per-token | Very High | Low | Quick start |
| AWS Bedrock | ⭐⭐ | Pay-per-token | Very High | Medium | Enterprise |
| Replicate | ⭐⭐⭐ | Pay-per-second | High | Medium | Prototyping |

## FastAPI Inference Server

```python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()
Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details