Phase 6: AI Inference & Optimisation
Training a model is step one. Serving it to millions of users efficiently is the real engineering challenge. AI inference optimisation reduces latency, cuts compute costs, and makes large models deployable on smaller hardware — without sacrificing much accuracy.
Deploy models faster, smaller, and cheaper
4 – 6 weeks
ONNX, TensorRT, vLLM, bitsandbytes, llama.cpp
Training vs Inference
🏋️ Training
- Done once (or periodically)
- Uses forward + backward pass
- Needs huge GPU clusters
- Goal: minimise loss
- Batch size: large (efficient)
⚡ Inference
- Done millions of times/day
- Forward pass only
- Must run fast and cheap
- Goal: minimise latency & cost
- Batch size: small (real-time)
The Inference Stack
Inference Optimisation Topics
Model Quantisation
Reduce precision from FP32 to INT8/INT4. Up to 4× smaller, 2–4× faster. The #1 inference technique.
Read Guide →Pruning & Distillation
Remove redundant weights (pruning) or train a small model to mimic a large one (distillation).
Read Guide →ONNX & TensorRT
Convert models to portable format (ONNX) and compile for maximum GPU speed (TensorRT).
Read Guide →Serving at Scale
vLLM, PagedAttention, continuous batching, Triton Inference Server. Serve LLMs to thousands of users.
Read Guide →Quick Wins: Fast Inference Checklist
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# Load a 7B model in 4-bit — fits in ~4GB VRAM instead of 14GB
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
quantization_config=quantization_config,
device_map="auto"
)
print(f"Memory used: {model.get_memory_footprint()/1e9:.2f} GB") Latency vs Throughput
These are the two key inference metrics — and they're often in tension:
Time for a single request (ms). Critical for real-time applications like chatbots. Optimise with: smaller batches, faster hardware, caching.
Requests processed per second. Critical for batch applications. Optimise with: larger batches, continuous batching, PagedAttention.
Frequently Asked Questions
How much accuracy do I lose with quantisation?
INT8 quantisation typically loses <1% accuracy on most tasks. INT4 loses 1–3% but is often acceptable. GPTQ and AWQ are more advanced quantisation methods that minimise accuracy loss. Always benchmark on your specific task.
What is PagedAttention and why does vLLM use it?
PagedAttention manages GPU memory for the KV cache like virtual memory in an OS — allocating in small pages rather than large contiguous blocks. This reduces memory fragmentation, allowing more parallel requests and dramatically increasing throughput.
Can I run inference on CPU only?
Yes! llama.cpp runs quantised LLMs on CPU (even M-series Macs). A 7B model at INT4 runs at ~10–30 tokens/second on an M2 MacBook. For production serving of many users, you'll want GPU — but CPU works for personal use and edge deployment.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.