Pruning & Knowledge Distillation
Large models are accurate but expensive. Pruning removes unnecessary weights. Knowledge distillation trains a small "student" model to mimic a large "teacher". Together they can produce models that are 10× smaller with only 5% accuracy loss.
Model Pruning
Neural networks are often over-parameterised — many weights contribute little to the output. Pruning identifies and removes (zeroes out) these redundant weights, making the model smaller and faster.
Unstructured Pruning
Remove individual weights based on magnitude. Results in sparse weight matrices.
Pros: High compression ratio, minimal accuracy loss
Cons: Sparse operations are hard to accelerate on standard GPUs
Structured Pruning
Remove entire neurons, attention heads, or layers. Resulting model is smaller and dense.
Pros: Real speedup on any hardware
Cons: Coarser granularity, more accuracy loss per parameter removed
import torch.nn.utils.prune as prune
model = MyModel()
# Prune 30% of connections in a layer by magnitude
prune.l1_unstructured(model.fc1, name='weight', amount=0.3)
# Check sparsity
total = model.fc1.weight.nelement()
zeros = torch.sum(model.fc1.weight == 0).item()
print(f"Sparsity: {zeros/total:.1%}") # → Sparsity: 30.0%
# Make pruning permanent (remove the mask)
prune.remove(model.fc1, 'weight') Knowledge Distillation
Instead of compressing the model itself, train a small student model to mimic the behaviour of a large teacher model. The student learns from both the hard labels (correct answers) and the teacher's soft probability distributions (which carry more information).
Input: "This movie is great!"
Learns from teacher's distribution, not just the hard label
Result: 40% smaller, 60% faster, 97% of teacher accuracy
import torch
import torch.nn.functional as F
def distillation_loss(student_logits, teacher_logits, true_labels, T=4, alpha=0.5):
# Soft labels loss (KL divergence between softened distributions)
soft_loss = F.kl_div(
F.log_softmax(student_logits / T, dim=1),
F.softmax(teacher_logits / T, dim=1),
reduction='batchmean'
) * (T ** 2)
# Hard labels loss (standard cross-entropy)
hard_loss = F.cross_entropy(student_logits, true_labels)
return alpha * soft_loss + (1 - alpha) * hard_loss
# Training
teacher.eval() # Teacher is frozen
for batch in dataloader:
with torch.no_grad():
teacher_logits = teacher(batch)
student_logits = student(batch)
loss = distillation_loss(student_logits, teacher_logits, batch.labels)
loss.backward()
optimizer.step() Real-World Example: DistilBERT
Hugging Face's DistilBERT is a publicly available distilled version of BERT. Trained with knowledge distillation from BERT-base, it achieves:
Smaller (66M vs 110M params)
Faster inference
Of BERT accuracy retained
from transformers import pipeline
# Drop-in replacement for BERT — much faster!
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier("This new framework is fantastic!")
print(result) # [{'label': 'POSITIVE', 'score': 0.9998}] Choosing Between Techniques
Frequently Asked Questions
Can I combine pruning and distillation?
Yes — this is called "progressive compression." First prune the teacher, then distil to a smaller student. Or simultaneously prune while distilling. Hugging Face's DistilBERT + quantisation is a popular stack that gives 5–8× total compression with ~97% accuracy.
What temperature should I use for distillation?
Temperature T in the range 2–10 is typical. Higher T softens the teacher's distribution, making it more informative for the student (the difference between 0.001% and 0.0001% probabilities carries signal about similarity). T=4 is a common default from the original Hinton et al. paper.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.