Model Quantisation
A 70B parameter model at full precision (FP32) needs 280GB of VRAM — impossible on consumer hardware. Quantisation reduces the number of bits used to represent each weight, shrinking models by 4–8× with minimal accuracy loss.
What is Quantisation?
Neural network weights are normally stored as 32-bit floating point numbers (FP32). Quantisation maps these to lower-precision formats — like 8-bit integers (INT8) or even 4-bit (INT4). The model becomes smaller and faster, since integer arithmetic is cheaper than floating point on most hardware.
🔢 Precision Comparison — See the Tradeoffs
Click a precision format to compare size, speed, and accuracy impact.
Post-Training Quantisation (PTQ)
The most practical approach — take an already-trained model and quantise it without any retraining. Fast, requires only a small calibration dataset.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# Load in 8-bit — uses bitsandbytes under the hood
model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_8bit=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Memory comparison
print(f"8-bit model: {model.get_memory_footprint()/1e9:.1f} GB")
# vs FP16: 8B × 2 bytes = ~16 GB · vs FP32: 8B × 4 bytes = ~32 GB from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 — best for LLM weights
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16 for speed
bnb_4bit_use_double_quant=True, # Quantise the quantisation constants too
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto"
)
# An 8B model now fits in ~5GB VRAM! GPTQ — GPU-Based Weight Quantisation
GPTQ (Generative Pre-Trained Transformer Quantisation) is a more sophisticated PTQ method that quantises weights only, layer-by-layer, minimising quantisation error using second-order (Hessian) information. It achieves better accuracy than naive INT4 and runs fast on GPU. Unlike AWQ, it does not consider activation magnitudes.
from auto_gptq import AutoGPTQForCausalLM
# Many GPTQ models are available on Hugging Face (TheBloke's collection)
model = AutoGPTQForCausalLM.from_quantized(
"TechxGenus/Meta-Llama-3-70B-GPTQ",
use_safetensors=True,
device="cuda:0"
)
# 70B model in INT4 GPTQ ≈ 36GB — fits on 2× A100 40GB AWQ — Activation-Aware Weight Quantisation
AWQ observes that not all weights are equally important. It protects the 1% of weights that have the highest activation magnitudes (the "salient" weights) while quantising the rest more aggressively. Often outperforms GPTQ at same bit-width.
Quantisation for CPU: llama.cpp
llama.cpp implements highly optimised CPU inference for quantised LLMs. Its GGUF format supports mixed-precision quantisation (different layers at different bits).
# Download a GGUF model (Q4_K_M = 4-bit, K-quant, medium)
# From: huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF
# Run with llama.cpp
./llama-cli -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
-p "Explain quantum computing simply:" \
-n 200 --threads 8
# With Ollama (easier):
ollama run llama3.1:8b Choosing the Right Quantisation
Use FP16/BF16. Requires full VRAM.
Use GPTQ or AWQ INT4. Small accuracy loss, 4× smaller.
Use GGUF Q4_K_M with llama.cpp. Runs on MacBook.
Use QLoRA (4-bit base + LoRA adapters). Train 70B on 2× A100.
Frequently Asked Questions
Does quantisation affect generation quality?
For INT8: almost unnoticeable. For INT4: slight degradation on complex reasoning tasks, usually acceptable for chat/summarisation. Run your own evaluation on your specific use case — benchmarks vary significantly by task type.
What is the difference between BF16 and FP16?
Both are 16-bit, but BF16 has a wider exponent range (same as FP32), avoiding overflow for large values. FP16 can overflow on large activations. BF16 is preferred for training; FP16 for inference on older GPUs that don't support BF16 natively.
Can I fine-tune a quantised model?
Not directly — quantised weights can't be updated efficiently. Instead use QLoRA: load the base model in 4-bit (frozen), add trainable LoRA adapters in FP16 on top. The adapters are tiny (0.1–1% of model params) and fit in memory easily.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.