Diffusion Models

Diffusion models are the engine behind Stable Diffusion, DALL-E 3, Midjourney, and Sora. They work by learning to reverse a noise process — starting from pure static and gradually denoising it into a coherent image. Stable training, high diversity, and easy text conditioning made them the dominant generative paradigm.

Prerequisites: U-Net architecture, attention mechanisms, basic probability. Familiarity with PyTorch and the diffusers library is helpful.

The Core Idea: Denoising as Generation

A diffusion model has two processes:

Forward Process q(xₜ | xₜ₋₁)

Fixed (no parameters). Adds Gaussian noise to an image over T timesteps until it becomes indistinguishable from N(0,I). This is the "destroy data" process.

xₜ = √(αₜ) xₜ₋₁ + √(1−αₜ) ε, where ε ~ N(0,I)

Reverse Process p_θ(xₜ₋₁ | xₜ)

Learned (neural network). Predicts and removes the noise added at each step. Run T times during inference to produce a clean image from noise.

The model predicts ε_θ(xₜ, t) — the noise to subtract at step t.

The U-Net Denoiser

The noise prediction network is a U-Net with attention layers. It takes:

The noisy image xₜ at timestep t
The timestep embedding (sinusoidal, like positional encoding in Transformers)
An optional conditioning signal (text embedding from CLIP)

The U-Net's encoder path downsamples via convolutions+attention, the bottleneck captures global context, and the decoder path upsamples with skip connections from the encoder.

Latent Diffusion Models (Stable Diffusion)

Running diffusion in pixel space at 512×512 is expensive — T=1000 forward passes through a large U-Net. Latent Diffusion Models (LDM) solve this by compressing images first:

VAE Encoder

Compress 512×512×3 image to 64×64×4 latent. 8× smaller, making diffusion 64× cheaper (spatial area).

↓

Diffusion in Latent Space

Run the entire forward/reverse process on the 64×64 latent representation. Text conditioning via cross-attention with CLIP text encoder.

↓

VAE Decoder

Decode the denoised 64×64×4 latent back to 512×512×3 pixel image.

Sampling Algorithms

DDPM1000HighOriginal paper; slow

DDIM50–100HighDeterministic; 10× faster

DPM-Solver++20–30Very HighBest speed/quality tradeoff

Euler Ancestral20–30HighPopular in Stable Diffusion UIs

LCM4–8GoodLatent Consistency Model; near real-time

Text Conditioning: Classifier-Free Guidance

Classifier-Free Guidance (CFG) controls how closely the output follows the text prompt. The model is trained both with and without conditioning. At inference:

ε_guided = ε_uncond + w \cdot (ε_cond - ε_uncond) w = guidance scale (typically 7-12). Higher w = more prompt-adherent but less diverse.

Fine-Tuning Techniques

🎭

DreamBooth

Fine-tune Stable Diffusion on 3–20 images of a specific subject. Uses a rare token ("sks") as an identifier. Enables personalised generation of people, pets, objects.

🎨

LoRA for Diffusion

Low-Rank Adaptation for diffusion U-Nets. Fine-tune styles, characters, or concepts in minutes on a single GPU. Widely used in the Stable Diffusion community (Civitai).

🎛️

ControlNet

Add spatial conditioning: edges, depth maps, poses, sketches. A small trainable copy of the U-Net encoder learns to inject structural guidance into generation.

5-Line Quick Start

from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
).to("cuda")

image = pipe("a photorealistic mountain at sunset, 4k").images[0]
image.save("output.png")

💡 Key Intuition

The model never "knows" what a real image looks like during inference — it only knows how to remove noise. Generation is purely iterative refinement from chaos. The quality comes entirely from the richness of patterns learned during training on billions of image-text pairs.

Frequently Asked Questions

How is Stable Diffusion XL different from SD 1.5?

SDXL (2023) uses a much larger U-Net (~2.6B parameters vs ~860M), natively generates 1024×1024, uses two CLIP text encoders for better prompt following, and adds a refiner model for final sharpening. It produces significantly better text rendering and composition than SD 1.5.

What is negative prompting?

In Classifier-Free Guidance, the "unconditional" generation can be replaced with a negative prompt. Instead of guiding away from pure noise, the model guides away from the negative prompt's characteristics. Common negatives: "blurry, low quality, extra limbs, watermark". Very effective for removing common artefacts.

Can diffusion models generate video?

Yes. Video diffusion models (Stable Video Diffusion, AnimateDiff, Sora) extend the U-Net with temporal attention layers to ensure frame-to-frame consistency. Sora uses a Diffusion Transformer (DiT) architecture — replacing the U-Net with a Transformer backbone — enabling much longer and higher-quality video sequences.

Diffusion Models

The Core Idea: Denoising as Generation

Forward Process q(xₜ | xₜ₋₁)

Reverse Process p_θ(xₜ₋₁ | xₜ)

The U-Net Denoiser

Latent Diffusion Models (Stable Diffusion)

Sampling Algorithms

Text Conditioning: Classifier-Free Guidance

Fine-Tuning Techniques

DreamBooth

LoRA for Diffusion

ControlNet

5-Line Quick Start

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?