Phase 6: AI Inference & Optimisation

Training a model is step one. Serving it to millions of users efficiently is the real engineering challenge. AI inference optimisation reduces latency, cuts compute costs, and makes large models deployable on smaller hardware — without sacrificing much accuracy.

🎯
Goal

Deploy models faster, smaller, and cheaper

⏱️
Time

4 – 6 weeks

🛠️
Tools

ONNX, TensorRT, vLLM, bitsandbytes, llama.cpp

Training vs Inference

🏋️ Training

  • Done once (or periodically)
  • Uses forward + backward pass
  • Needs huge GPU clusters
  • Goal: minimise loss
  • Batch size: large (efficient)

⚡ Inference

  • Done millions of times/day
  • Forward pass only
  • Must run fast and cheap
  • Goal: minimise latency & cost
  • Batch size: small (real-time)

The Inference Stack

Application / API Layer
Serving Framework (vLLM, TGI, Triton)
Runtime (ONNX Runtime, TensorRT, llama.cpp)
Optimised Model (quantised, pruned, compiled)
Hardware (GPU, CPU, NPU, Edge Device)

Inference Optimisation Topics

Quick Wins: Fast Inference Checklist

Use half precision (FP16/BF16) — 2× memory reduction, same accuracy on modern GPUs
Quantise to INT8 — 4× smaller model, ~2× faster, minimal accuracy loss
Use batch inference — Process multiple requests together for GPU efficiency
Enable KV cache — Don't recompute previous token representations
Use Flash Attention — Memory-efficient attention with the same results
Compile with torch.compile() — 10–30% speedup from graph optimisation
Use vLLM for LLMs — PagedAttention gives 24× higher throughput than naive serving
Python · Quick Quantisation with bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Load a 7B model in 4-bit — fits in ~4GB VRAM instead of 14GB
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    quantization_config=quantization_config,
    device_map="auto"
)
print(f"Memory used: {model.get_memory_footprint()/1e9:.2f} GB")

Latency vs Throughput

These are the two key inference metrics — and they're often in tension:

Latency

Time for a single request (ms). Critical for real-time applications like chatbots. Optimise with: smaller batches, faster hardware, caching.

Throughput

Requests processed per second. Critical for batch applications. Optimise with: larger batches, continuous batching, PagedAttention.

Frequently Asked Questions

How much accuracy do I lose with quantisation?

INT8 quantisation typically loses <1% accuracy on most tasks. INT4 loses 1–3% but is often acceptable. GPTQ and AWQ are more advanced quantisation methods that minimise accuracy loss. Always benchmark on your specific task.

What is PagedAttention and why does vLLM use it?

PagedAttention manages GPU memory for the KV cache like virtual memory in an OS — allocating in small pages rather than large contiguous blocks. This reduces memory fragmentation, allowing more parallel requests and dramatically increasing throughput.

Can I run inference on CPU only?

Yes! llama.cpp runs quantised LLMs on CPU (even M-series Macs). A 7B model at INT4 runs at ~10–30 tokens/second on an M2 MacBook. For production serving of many users, you'll want GPU — but CPU works for personal use and edge deployment.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.