Transformers & Attention
The Transformer, introduced in the 2017 paper "Attention Is All You Need", completely replaced RNNs for sequence tasks and became the foundation for GPT, BERT, Gemini, and every modern large language model.
The paper that started the LLM revolution. Introduced self-attention and the Transformer architecture.
The Problem with RNNs
Before Transformers, sequences were processed with RNNs — one word at a time, left to right. This caused two big problems:
Can't parallelise — word 10 needs word 9's output. Very slow to train on long sequences.
Information from word 1 gets diluted by word 100. Long-range dependencies are hard to learn.
Transformers solve both problems with self-attention: every word looks at every other word simultaneously, in parallel.
Self-Attention: Every Word Talks to Every Word
Self-attention computes a relevance score between each pair of words. For the sentence "The animal didn't cross the street because it was too tired" — the word "it" needs to know it refers to "animal", not "street". Self-attention learns this.
🔥 Attention Heatmap — Hover to Explore
Move your cursor over a word to see how much attention it pays to each other word.
Query, Key, Value
Each word creates three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what do I give?).
What does "it" refer to?
What info do I hold?
My info to share
softmax(QKᵀ / √dₖ) · V
import torch
import torch.nn.functional as F
def self_attention(Q, K, V):
d_k = Q.shape[-1]
# Attention scores: how much each token attends to others
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
weights = F.softmax(scores, dim=-1) # Normalize to probabilities
return torch.matmul(weights, V), weights
# In practice, PyTorch has this built in:
import torch.nn as nn
attn = nn.MultiheadAttention(embed_dim=512, num_heads=8)
output, weights = attn(query, key, value) Multi-Head Attention
Instead of one attention mechanism, Transformers run multiple attention heads in parallel, each learning different types of relationships:
All heads' outputs are concatenated and linearly projected back to the original dimension.
Positional Encoding
Self-attention is order-agnostic — "cat sat on mat" and "mat on sat cat" would look the same. Positional encodings add position information to each token's embedding, using sine and cosine functions of different frequencies.
The Full Transformer Architecture
BERT uses encoder-only (good for understanding tasks: classification, NER). GPT uses decoder-only (good for generation: text completion, chat). T5, BART use encoder-decoder (good for translation, summarisation).
Why Transformers Won
All tokens processed simultaneously. Train on massive datasets in days instead of months.
Direct connections between any two positions. No degradation over distance.
Performance keeps improving with more parameters and more data. No ceiling found yet.
Pre-train once on internet-scale text, fine-tune cheaply for any downstream task.
Frequently Asked Questions
What is the context window?
The context window is how many tokens a Transformer can "see" at once. Early GPT-2 had 1,024 tokens. GPT-4 Turbo has 128,000 tokens. Each token in the window attends to all others — so quadratic memory cost (O(n²)) makes very long contexts expensive.
What is the difference between attention and self-attention?
Regular attention (in seq2seq models) attends across two sequences: the decoder query attends to encoder keys/values. Self-attention attends within a single sequence — every token attends to every other token in the same sequence.
Do I need to implement Transformers from scratch?
No. For practical work, use Hugging Face Transformers library. It gives you pre-trained models (GPT-2, BERT, Llama, etc.) with a single line of code. Understanding the internals helps you debug and tune them effectively.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.