Serving AI Models at Scale
Serving a model to one user is easy. Serving it to 10,000 concurrent users at low latency and reasonable cost is a real engineering challenge. This page covers the tools and patterns that make production LLM serving possible.
The Problem with Naive Serving
One request at a time → GPU sits idle between tokens, terrible throughput
Unpredictable output lengths → memory wasted on over-allocation
Short requests wait for long ones to finish → high latency spikes
vLLM — The Industry Standard LLM Server
vLLM is the most widely used open-source LLM serving library. Its two key innovations are PagedAttention and continuous batching.
pip install vllm
# Serve Llama 3 8B on a single A100
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--max-model-len 8192 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9
# Now accepts OpenAI-compatible API calls at localhost:8000 from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Explain neural networks simply."}],
max_tokens=256,
temperature=0.7
)
print(response.choices[0].message.content) PagedAttention — vLLM's Secret Weapon
The KV cache (which stores attention keys and values for each token) is the main memory consumer during inference. Traditional serving pre-allocates the maximum context size for every request, wasting memory.
PagedAttention manages KV cache memory like an OS manages virtual memory — using fixed-size "pages" that can be scattered across GPU memory and shared between sequences.
❌ Naive KV Cache
50% memory wasted
✅ PagedAttention
~96% utilisation
Continuous Batching
Traditional batching waits for an entire batch to finish before accepting new requests. Continuous batching inserts new requests mid-generation, as slots free up. This dramatically improves throughput for workloads with mixed request lengths.
FastAPI Wrapper (for Custom Models)
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline
import uvicorn
app = FastAPI()
model = pipeline("text-generation", model="gpt2", device=0)
class Request(BaseModel):
prompt: str
max_tokens: int = 200
temperature: float = 0.7
@app.post("/generate")
async def generate(req: Request):
result = model(
req.prompt,
max_new_tokens=req.max_tokens,
temperature=req.temperature,
do_sample=True
)
return {"text": result[0]["generated_text"]}
@app.get("/health")
async def health(): return {"status": "ok"}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000, workers=1) Production Checklist
Frequently Asked Questions
How many concurrent users can one GPU handle?
It depends heavily on model size, context length, and request rate. A single A100 running Llama 3 8B with vLLM can typically handle 50–200 concurrent streaming sessions. Use load testing (locust or k6) to find your specific saturation point.
What is Triton Inference Server?
NVIDIA Triton Inference Server is a more general inference serving system that supports PyTorch, TensorFlow, ONNX, TensorRT, and Python backends. It handles dynamic batching, model ensembles, and concurrent model execution. More complex to set up than vLLM but more flexible for non-LLM models.
How do I reduce time-to-first-token (TTFT)?
TTFT is latency until the first token appears. Reduce it by: (1) prefill caching — cache the KV cache for common system prompts, (2) speculative decoding — run a small draft model ahead, (3) use a smaller model for latency-critical paths, (4) reduce max context length for faster prefill.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.