Model Quantisation

A 70B parameter model at full precision (FP32) needs 280GB of VRAM — impossible on consumer hardware. Quantisation reduces the number of bits used to represent each weight, shrinking models by 4–8× with minimal accuracy loss.

What is Quantisation?

Neural network weights are normally stored as 32-bit floating point numbers (FP32). Quantisation maps these to lower-precision formats — like 8-bit integers (INT8) or even 4-bit (INT4). The model becomes smaller and faster, since integer arithmetic is cheaper than floating point on most hardware.

🔢 Precision Comparison — See the Tradeoffs

Click a precision format to compare size, speed, and accuracy impact.

FP3232±3.4×10³⁸280 GBNone (baseline)

FP1616±65,504140 GB<0.1%

BF1616±3.4×10³⁸140 GB<0.1%

INT88-128 to 12770 GB<1%

INT44-8 to 735 GB1–3%

Post-Training Quantisation (PTQ)

The most practical approach — take an already-trained model and quantise it without any retraining. Fast, requires only a small calibration dataset.

Python · INT8 with bitsandbytes

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Load in 8-bit — uses bitsandbytes under the hood
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_8bit=True,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Memory comparison
print(f"8-bit model: {model.get_memory_footprint()/1e9:.1f} GB")
# vs FP16: 8B × 2 bytes = ~16 GB · vs FP32: 8B × 4 bytes = ~32 GB

Python · 4-bit NF4 Quantisation (QLoRA style)

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NormalFloat4 — best for LLM weights
    bnb_4bit_compute_dtype=torch.bfloat16,  # Compute in BF16 for speed
    bnb_4bit_use_double_quant=True,     # Quantise the quantisation constants too
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)
# An 8B model now fits in ~5GB VRAM!

GPTQ — GPU-Based Weight Quantisation

GPTQ (Generative Pre-Trained Transformer Quantisation) is a more sophisticated PTQ method that quantises weights only, layer-by-layer, minimising quantisation error using second-order (Hessian) information. It achieves better accuracy than naive INT4 and runs fast on GPU. Unlike AWQ, it does not consider activation magnitudes.

Python · Load GPTQ-quantised model

from auto_gptq import AutoGPTQForCausalLM

# Many GPTQ models are available on Hugging Face (TheBloke's collection)
model = AutoGPTQForCausalLM.from_quantized(
    "TechxGenus/Meta-Llama-3-70B-GPTQ",
    use_safetensors=True,
    device="cuda:0"
)
# 70B model in INT4 GPTQ ≈ 36GB — fits on 2× A100 40GB

AWQ — Activation-Aware Weight Quantisation

AWQ observes that not all weights are equally important. It protects the 1% of weights that have the highest activation magnitudes (the "salient" weights) while quantising the rest more aggressively. Often outperforms GPTQ at same bit-width.

Quantisation for CPU: llama.cpp

llama.cpp implements highly optimised CPU inference for quantised LLMs. Its GGUF format supports mixed-precision quantisation (different layers at different bits).

Terminal · Run Llama 3 8B on CPU with llama.cpp

# Download a GGUF model (Q4_K_M = 4-bit, K-quant, medium)
# From: huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF

# Run with llama.cpp
./llama-cli -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
  -p "Explain quantum computing simply:" \
  -n 200 --threads 8

# With Ollama (easier):
ollama run llama3.1:8b

Choosing the Right Quantisation

🎯

Maximum quality
Use FP16/BF16. Requires full VRAM.

⚖️

Best quality/size tradeoff
Use GPTQ or AWQ INT4. Small accuracy loss, 4× smaller.

🖥️

CPU / edge deployment
Use GGUF Q4_K_M with llama.cpp. Runs on MacBook.

🔬

Fine-tuning quantised models
Use QLoRA (4-bit base + LoRA adapters). Train 70B on 2× A100.

Frequently Asked Questions

Does quantisation affect generation quality?

For INT8: almost unnoticeable. For INT4: slight degradation on complex reasoning tasks, usually acceptable for chat/summarisation. Run your own evaluation on your specific use case — benchmarks vary significantly by task type.

What is the difference between BF16 and FP16?

Both are 16-bit, but BF16 has a wider exponent range (same as FP32), avoiding overflow for large values. FP16 can overflow on large activations. BF16 is preferred for training; FP16 for inference on older GPUs that don't support BF16 natively.

Can I fine-tune a quantised model?

Not directly — quantised weights can't be updated efficiently. Instead use QLoRA: load the base model in 4-bit (frozen), add trainable LoRA adapters in FP16 on top. The adapters are tiny (0.1–1% of model params) and fit in memory easily.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.