Phase 6: AI Inference & Optimisation

Training a model is step one. Serving it to millions of users efficiently is the real engineering challenge. AI inference optimisation reduces latency, cuts compute costs, and makes large models deployable on smaller hardware — without sacrificing much accuracy.

🎯

Goal

Deploy models faster, smaller, and cheaper

⏱️

Time

4 – 6 weeks

🛠️

Tools

ONNX, TensorRT, vLLM, bitsandbytes, llama.cpp

Training vs Inference

🏋️ Training

Done once (or periodically)
Uses forward + backward pass
Needs huge GPU clusters
Goal: minimise loss
Batch size: large (efficient)

⚡ Inference

Done millions of times/day
Forward pass only
Must run fast and cheap
Goal: minimise latency & cost
Batch size: small (real-time)

The Inference Stack

Application / API Layer

Serving Framework (vLLM, TGI, Triton)

Runtime (ONNX Runtime, TensorRT, llama.cpp)

Optimised Model (quantised, pruned, compiled)

Hardware (GPU, CPU, NPU, Edge Device)

Inference Optimisation Topics

🔢

Model Quantisation

Reduce precision from FP32 to INT8/INT4. Up to 4× smaller, 2–4× faster. The #1 inference technique.

Read Guide →

✂️

Pruning & Distillation

Remove redundant weights (pruning) or train a small model to mimic a large one (distillation).

Read Guide →

🚀

ONNX & TensorRT

Convert models to portable format (ONNX) and compile for maximum GPU speed (TensorRT).

Read Guide →

🌐

Serving at Scale

vLLM, PagedAttention, continuous batching, Triton Inference Server. Serve LLMs to thousands of users.

Read Guide →

Quick Wins: Fast Inference Checklist

✅

Use half precision (FP16/BF16) — 2× memory reduction, same accuracy on modern GPUs

✅

Quantise to INT8 — 4× smaller model, ~2× faster, minimal accuracy loss

✅

Use batch inference — Process multiple requests together for GPU efficiency

✅

Enable KV cache — Don't recompute previous token representations

✅

Use Flash Attention — Memory-efficient attention with the same results

✅

Compile with torch.compile() — 10–30% speedup from graph optimisation

✅

Use vLLM for LLMs — PagedAttention gives 24× higher throughput than naive serving

Python · Quick Quantisation with bitsandbytes

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Load a 7B model in 4-bit — fits in ~4GB VRAM instead of 14GB
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    quantization_config=quantization_config,
    device_map="auto"
)
print(f"Memory used: {model.get_memory_footprint()/1e9:.2f} GB")

Latency vs Throughput

These are the two key inference metrics — and they're often in tension:

Latency

Time for a single request (ms). Critical for real-time applications like chatbots. Optimise with: smaller batches, faster hardware, caching.

Throughput

Requests processed per second. Critical for batch applications. Optimise with: larger batches, continuous batching, PagedAttention.

Frequently Asked Questions

How much accuracy do I lose with quantisation?

INT8 quantisation typically loses <1% accuracy on most tasks. INT4 loses 1–3% but is often acceptable. GPTQ and AWQ are more advanced quantisation methods that minimise accuracy loss. Always benchmark on your specific task.

What is PagedAttention and why does vLLM use it?

PagedAttention manages GPU memory for the KV cache like virtual memory in an OS — allocating in small pages rather than large contiguous blocks. This reduces memory fragmentation, allowing more parallel requests and dramatically increasing throughput.

Can I run inference on CPU only?

Yes! llama.cpp runs quantised LLMs on CPU (even M-series Macs). A 7B model at INT4 runs at ~10–30 tokens/second on an M2 MacBook. For production serving of many users, you'll want GPU — but CPU works for personal use and edge deployment.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.