How LLMs Work

An LLM is a function that takes text in and gives text out. But underneath, it's converting words into numbers, running them through hundreds of Transformer layers, and sampling the most likely next word — billions of times per second.

Step 1: Tokenisation

LLMs don't read characters or words — they read tokens. A token is typically 3–4 characters. Common words are single tokens; rare words are split into multiple tokens.

✂️ Interactive Token Visualizer

Type text to see how it's broken into tokens (approximating GPT tokenisation):

cat

1 token

beautiful

3 tokens: beau·ti·ful

ChatGPT

3 tokens: Chat·G·PT

Hello, world!

4 tokens

💰 Why Tokens Matter for Pricing

API pricing is per token. "1,000 tokens ≈ 750 words." A page of text is ~500 tokens. A typical ChatGPT conversation might use 2,000–5,000 tokens. At typical frontier-model rates of $2–5 per million input tokens, that's roughly $0.004–$0.025 per conversation — check your provider's current pricing page for exact rates.

Step 2: Embeddings

Each token ID is converted to a dense vector (an embedding) — a list of 768, 4096, or more floating-point numbers. These embeddings are learned during training. Similar concepts end up with similar vectors:

Simplified 2D projection of word embeddings. Similar words cluster together.

Step 3: The Transformer Layers

The embeddings flow through N Transformer layers (e.g., GPT-3 has 96 layers). Each layer refines the representation by attending to all other tokens and passing through a feed-forward network. After the final layer, the representation is projected to a vocabulary-sized logit vector (one score per possible next token).

GPT-2 Small1212768

GPT-3969612,288

Llama 3 8B32324,096

Llama 3 70B80648,192

Step 4: Sampling — How Text is Generated

The final layer produces a probability distribution over the entire vocabulary (~50,000+ tokens). The model then samples from this distribution to pick the next token.

🎲 Interactive: Temperature & Sampling

Adjust temperature to see how it affects token selection.

Temperature: 0.7

Low temp → always picks most likely token (deterministic). High temp → more random/creative.

Greedy
Always pick the highest probability token. Fast but repetitive.

Temperature Sampling
Divide logits by temperature T before softmax. T <1 = sharper, T >1 = flatter.

Top-K
Sample only from the top K most likely tokens. Balances quality and diversity.

Top-P (Nucleus)
Sample from the smallest set of tokens whose cumulative probability ≥ P. Most common in practice.

The Context Window

The context window is the maximum number of tokens the model can process at once — both input prompt and output response combined. Tokens outside the window are completely invisible to the model.

System Prompt (200 tok)

Chat History (3,000 tok)

User Message (100 tok)

Response (700 tok)

Available (124,000 tok)

Example context usage for a Claude 3.5 Sonnet conversation (128K context window)

KV Cache — Why Inference is Fast

During generation, the model computes Key and Value matrices for every token in the context. For each new token generated, recomputing all previous K/V is wasteful. The KV cache saves the computed K/V pairs and reuses them, making token-by-token generation fast.

The KV cache grows linearly with context length — a key bottleneck for serving very long contexts.

Frequently Asked Questions

Why do LLMs sometimes repeat themselves?

Repetition happens when the model assigns very high probability to recently seen tokens. This is why many APIs include a repetition_penalty or frequency_penalty parameter that reduces the probability of recently used tokens.

What's the difference between parameters and context?

Parameters are the model's weights — its permanent "knowledge" baked in during training. The context window is the temporary "working memory" — what the model can see right now. A 70B parameter model can still only "remember" what's in its context window at inference time.

Why does the same prompt give different answers each time?

Because of temperature sampling. Unless you set temperature=0 (or use a fixed seed), each run samples from the probability distribution differently. This is intentional — it produces more natural, varied responses.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.