Serving AI Models at Scale

Serving a model to one user is easy. Serving it to 10,000 concurrent users at low latency and reasonable cost is a real engineering challenge. This page covers the tools and patterns that make production LLM serving possible.

The Problem with Naive Serving

🐌

Sequential requests

One request at a time → GPU sits idle between tokens, terrible throughput

💾

KV cache fragmentation

Unpredictable output lengths → memory wasted on over-allocation

⏱️

Fixed batch size

Short requests wait for long ones to finish → high latency spikes

vLLM — The Industry Standard LLM Server

vLLM is the most widely used open-source LLM serving library. Its two key innovations are PagedAttention and continuous batching.

Terminal · Start vLLM Server

pip install vllm

# Serve Llama 3 8B on a single A100
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --max-model-len 8192 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9

# Now accepts OpenAI-compatible API calls at localhost:8000

Python · Query vLLM with OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Explain neural networks simply."}],
    max_tokens=256,
    temperature=0.7
)
print(response.choices[0].message.content)

PagedAttention — vLLM's Secret Weapon

The KV cache (which stores attention keys and values for each token) is the main memory consumer during inference. Traditional serving pre-allocates the maximum context size for every request, wasting memory.

PagedAttention manages KV cache memory like an OS manages virtual memory — using fixed-size "pages" that can be scattered across GPU memory and shared between sequences.

❌ Naive KV Cache

Req 1 (used)

Req 1 (wasted)

Req 2 (used)

Req 2 (wasted)

Free

50% memory wasted

✅ PagedAttention

~96% utilisation

Continuous Batching

Traditional batching waits for an entire batch to finish before accepting new requests. Continuous batching inserts new requests mid-generation, as slots free up. This dramatically improves throughput for workloads with mixed request lengths.

Naive (fixed batch)

1×

Dynamic batching

3×

vLLM continuous batching

24×

FastAPI Wrapper (for Custom Models)

Python · Production-ready inference API

from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline
import uvicorn

app = FastAPI()
model = pipeline("text-generation", model="gpt2", device=0)

class Request(BaseModel):
    prompt: str
    max_tokens: int = 200
    temperature: float = 0.7

@app.post("/generate")
async def generate(req: Request):
    result = model(
        req.prompt,
        max_new_tokens=req.max_tokens,
        temperature=req.temperature,
        do_sample=True
    )
    return {"text": result[0]["generated_text"]}

@app.get("/health")
async def health(): return {"status": "ok"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000, workers=1)

Production Checklist

✅Use vLLM for open-source LLMs — 24× throughput over naive serving

✅Enable streaming — stream tokens as they're generated for better UX

✅Set max_model_len — cap context to save VRAM for more concurrent requests

✅Monitor GPU utilisation — should stay 70–90% during peak

✅Add rate limiting — protect against abuse with per-user token limits

✅Cache common queries — exact-match or semantic cache (Redis + vector DB)

✅Set up auto-scaling — scale GPU instances based on queue depth

Frequently Asked Questions

How many concurrent users can one GPU handle?

It depends heavily on model size, context length, and request rate. A single A100 running Llama 3 8B with vLLM can typically handle 50–200 concurrent streaming sessions. Use load testing (locust or k6) to find your specific saturation point.

What is Triton Inference Server?

NVIDIA Triton Inference Server is a more general inference serving system that supports PyTorch, TensorFlow, ONNX, TensorRT, and Python backends. It handles dynamic batching, model ensembles, and concurrent model execution. More complex to set up than vLLM but more flexible for non-LLM models.

How do I reduce time-to-first-token (TTFT)?

TTFT is latency until the first token appears. Reduce it by: (1) prefill caching — cache the KV cache for common system prompts, (2) speculative decoding — run a small draft model ahead, (3) use a smaller model for latency-critical paths, (4) reduce max context length for faster prefill.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.