Transformers & Attention

The Transformer, introduced in the 2017 paper "Attention Is All You Need", completely replaced RNNs for sequence tasks and became the foundation for GPT, BERT, Gemini, and every modern large language model.

📄

Attention Is All You Need — Vaswani et al. (2017)

The paper that started the LLM revolution. Introduced self-attention and the Transformer architecture.

The Problem with RNNs

Before Transformers, sequences were processed with RNNs — one word at a time, left to right. This caused two big problems:

🐢

Sequential bottleneck
Can't parallelise — word 10 needs word 9's output. Very slow to train on long sequences.

🧠

Forgetting early context
Information from word 1 gets diluted by word 100. Long-range dependencies are hard to learn.

Transformers solve both problems with self-attention: every word looks at every other word simultaneously, in parallel.

Self-Attention: Every Word Talks to Every Word

Self-attention computes a relevance score between each pair of words. For the sentence "The animal didn't cross the street because it was too tired" — the word "it" needs to know it refers to "animal", not "street". Self-attention learns this.

🔥 Attention Heatmap — Hover to Explore

Move your cursor over a word to see how much attention it pays to each other word.

Low attention

High attention

Query, Key, Value

Each word creates three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what do I give?).

Word "it"

→

Query Q
What does "it" refer to?

Key K
What info do I hold?

Value V
My info to share

→

Attention(Q, K, V) =
softmax(QKᵀ / √dₖ) · V

Python · Self-Attention from Scratch

import torch
import torch.nn.functional as F

def self_attention(Q, K, V):
    d_k = Q.shape[-1]
    # Attention scores: how much each token attends to others
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    weights = F.softmax(scores, dim=-1)   # Normalize to probabilities
    return torch.matmul(weights, V), weights

# In practice, PyTorch has this built in:
import torch.nn as nn
attn = nn.MultiheadAttention(embed_dim=512, num_heads=8)
output, weights = attn(query, key, value)

Multi-Head Attention

Instead of one attention mechanism, Transformers run multiple attention heads in parallel, each learning different types of relationships:

Head 1Grammatical relationships (subject-verb)

Head 2Coreference resolution (it → animal)

Head 3Long-range dependencies

Head 4Local context (adjacent words)

… up to 96 heads in GPT-3

All heads' outputs are concatenated and linearly projected back to the original dimension.

Positional Encoding

Self-attention is order-agnostic — "cat sat on mat" and "mat on sat cat" would look the same. Positional encodings add position information to each token's embedding, using sine and cosine functions of different frequencies.

The Full Transformer Architecture

Encoder

Multi-Head Self-Attention

Add & Norm

Feed-Forward Network

Add & Norm

× N layers

→

Decoder

Masked Multi-Head Attention

Add & Norm

Cross-Attention

Feed-Forward Network

× N layers

🔑 Encoder-only vs Decoder-only

BERT uses encoder-only (good for understanding tasks: classification, NER). GPT uses decoder-only (good for generation: text completion, chat). T5, BART use encoder-decoder (good for translation, summarisation).

Why Transformers Won

⚡Parallelism

All tokens processed simultaneously. Train on massive datasets in days instead of months.

🔗Long-range context

Direct connections between any two positions. No degradation over distance.

📈Scale

Performance keeps improving with more parameters and more data. No ceiling found yet.

🔄Transfer learning

Pre-train once on internet-scale text, fine-tune cheaply for any downstream task.

Frequently Asked Questions

What is the context window?

The context window is how many tokens a Transformer can "see" at once. Early GPT-2 had 1,024 tokens. GPT-4 Turbo has 128,000 tokens. Each token in the window attends to all others — so quadratic memory cost (O(n²)) makes very long contexts expensive.

What is the difference between attention and self-attention?

Regular attention (in seq2seq models) attends across two sequences: the decoder query attends to encoder keys/values. Self-attention attends within a single sequence — every token attends to every other token in the same sequence.

Do I need to implement Transformers from scratch?

No. For practical work, use Hugging Face Transformers library. It gives you pre-trained models (GPT-2, BERT, Llama, etc.) with a single line of code. Understanding the internals helps you debug and tune them effectively.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.