Diffusion Models

Diffusion models are the engine behind Stable Diffusion, DALL-E 3, Midjourney, and Sora. They work by learning to reverse a noise process — starting from pure static and gradually denoising it into a coherent image. Stable training, high diversity, and easy text conditioning made them the dominant generative paradigm.

Prerequisites: U-Net architecture, attention mechanisms, basic probability. Familiarity with PyTorch and the diffusers library is helpful.

The Core Idea: Denoising as Generation

A diffusion model has two processes:

Forward Process q(xₜ | xₜ₋₁)

Fixed (no parameters). Adds Gaussian noise to an image over T timesteps until it becomes indistinguishable from N(0,I). This is the "destroy data" process.

xₜ = √(αₜ) xₜ₋₁ + √(1−αₜ) ε, where ε ~ N(0,I)

Reverse Process p_θ(xₜ₋₁ | xₜ)

Learned (neural network). Predicts and removes the noise added at each step. Run T times during inference to produce a clean image from noise.

The model predicts ε_θ(xₜ, t) — the noise to subtract at step t.

The U-Net Denoiser

The noise prediction network is a U-Net with attention layers. It takes:

  • The noisy image xₜ at timestep t
  • The timestep embedding (sinusoidal, like positional encoding in Transformers)
  • An optional conditioning signal (text embedding from CLIP)

The U-Net's encoder path downsamples via convolutions+attention, the bottleneck captures global context, and the decoder path upsamples with skip connections from the encoder.

Latent Diffusion Models (Stable Diffusion)

Running diffusion in pixel space at 512×512 is expensive — T=1000 forward passes through a large U-Net. Latent Diffusion Models (LDM) solve this by compressing images first:

1
VAE Encoder

Compress 512×512×3 image to 64×64×4 latent. 8× smaller, making diffusion 64× cheaper (spatial area).

2
Diffusion in Latent Space

Run the entire forward/reverse process on the 64×64 latent representation. Text conditioning via cross-attention with CLIP text encoder.

3
VAE Decoder

Decode the denoised 64×64×4 latent back to 512×512×3 pixel image.

Sampling Algorithms

SamplerStepsQualityNotes
DDPM1000HighOriginal paper; slow
DDIM50–100HighDeterministic; 10× faster
DPM-Solver++20–30Very HighBest speed/quality tradeoff
Euler Ancestral20–30HighPopular in Stable Diffusion UIs
LCM4–8GoodLatent Consistency Model; near real-time

Text Conditioning: Classifier-Free Guidance

Classifier-Free Guidance (CFG) controls how closely the output follows the text prompt. The model is trained both with and without conditioning. At inference:

ε_guided = ε_uncond + w · (ε_cond − ε_uncond)

w = guidance scale (typically 7–12). Higher w = more prompt-adherent but less diverse.

Fine-Tuning Techniques

🎭

DreamBooth

Fine-tune Stable Diffusion on 3–20 images of a specific subject. Uses a rare token ("sks") as an identifier. Enables personalised generation of people, pets, objects.

🎨

LoRA for Diffusion

Low-Rank Adaptation for diffusion U-Nets. Fine-tune styles, characters, or concepts in minutes on a single GPU. Widely used in the Stable Diffusion community (Civitai).

🎛️

ControlNet

Add spatial conditioning: edges, depth maps, poses, sketches. A small trainable copy of the U-Net encoder learns to inject structural guidance into generation.

5-Line Quick Start

from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
).to("cuda")

image = pipe("a photorealistic mountain at sunset, 4k").images[0]
image.save("output.png")
💡 Key Intuition

The model never "knows" what a real image looks like during inference — it only knows how to remove noise. Generation is purely iterative refinement from chaos. The quality comes entirely from the richness of patterns learned during training on billions of image-text pairs.

Frequently Asked Questions

How is Stable Diffusion XL different from SD 1.5?

SDXL (2023) uses a much larger U-Net (~2.6B parameters vs ~860M), natively generates 1024×1024, uses two CLIP text encoders for better prompt following, and adds a refiner model for final sharpening. It produces significantly better text rendering and composition than SD 1.5.

What is negative prompting?

In Classifier-Free Guidance, the "unconditional" generation can be replaced with a negative prompt. Instead of guiding away from pure noise, the model guides away from the negative prompt's characteristics. Common negatives: "blurry, low quality, extra limbs, watermark". Very effective for removing common artefacts.

Can diffusion models generate video?

Yes. Video diffusion models (Stable Video Diffusion, AnimateDiff, Sora) extend the U-Net with temporal attention layers to ensure frame-to-frame consistency. Sora uses a Diffusion Transformer (DiT) architecture — replacing the U-Net with a Transformer backbone — enabling much longer and higher-quality video sequences.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.