GANs & VAEs

Generative Adversarial Networks (GANs) introduced a revolutionary training paradigm in 2014: pit two neural networks against each other until one becomes a master forger. Variational Autoencoders (VAEs) take a probabilistic approach — learning a compressed latent space from which new data can be sampled. Both remain foundational to understanding generative AI.

Prerequisites: Neural networks, CNNs, backpropagation, PyTorch basics. This page assumes you understand how to train a standard classifier.

How GANs Work

A GAN consists of two networks trained simultaneously with opposing objectives:

Generator G

Takes random noise z ~ N(0,1) as input. Produces fake samples (images, audio, etc.). Trained to fool the Discriminator — maximise the probability that D labels its output as real.

G(z) → fake_image

Discriminator D

Takes a sample (real or fake) as input. Outputs a probability that the input is real. Trained to correctly classify real vs fake samples.

D(x) → P(real) ∈ [0,1]

The Minimax Objective

The GAN training objective is a minimax game:

minG maxD V(D,G) =

𝔼x~p_data[log D(x)] + 𝔼z~p_z[log(1 − D(G(z)))]

D wants to maximise this (correctly classify real and fake). G wants to minimise the second term (make D think fakes are real). At the Nash equilibrium, G produces samples indistinguishable from real data.

GAN Training in PyTorch

# Training loop (simplified)
for real_images, _ in dataloader:
    # --- Train Discriminator ---
    optimizer_D.zero_grad()
    real_loss = criterion(D(real_images), ones)   # D(real) → 1
    fake = G(torch.randn(batch, latent_dim))
    fake_loss = criterion(D(fake.detach()), zeros) # D(fake) → 0
    loss_D = (real_loss + fake_loss) / 2
    loss_D.backward(); optimizer_D.step()

    # --- Train Generator ---
    optimizer_G.zero_grad()
    gen_loss = criterion(D(fake), ones)  # fool D
    gen_loss.backward(); optimizer_G.step()

Common GAN Failure Modes

💥

Mode Collapse

Generator learns to produce a single convincing output that fools D, ignoring the full data distribution. Fix: minibatch discrimination, unrolled GANs, Wasserstein loss.

⚖️

Training Instability

If D becomes too strong, gradients to G vanish. If D is too weak, G doesn't learn meaningful signal. Requires careful learning rate tuning and architectural choices.

📉

Vanishing Gradients

Original GAN loss saturates early. Solution: use non-saturating loss (minimise −log D(G(z)) instead of log(1−D(G(z)))) or Wasserstein GAN with gradient penalty.

Notable GAN Architectures

ModelYearKey Innovation
DCGAN2015CNN-based G and D; batch norm; stable training recipe
WGAN-GP2017Wasserstein distance + gradient penalty; fixes vanishing gradients
Pix2Pix2017Conditional image-to-image translation (edges → photo)
CycleGAN2017Unpaired image translation using cycle-consistency loss
StyleGAN22020Disentangled style control; photorealistic human faces
BigGAN2018Class-conditional generation at scale; ImageNet quality

Variational Autoencoders (VAEs)

VAEs take a different approach. Instead of adversarial training, they learn a probabilistic latent space where similar data clusters together and new data can be sampled.

E
Encoder

Maps input x to a distribution in latent space: outputs mean μ and variance σ² for each latent dimension.

Z
Reparameterisation Trick

Sample z = μ + σ·ε where ε ~ N(0,1). This makes the sampling step differentiable so gradients can flow back through encoder.

D
Decoder

Reconstructs x from z. Loss = Reconstruction loss + KL divergence (keeps latent space regular and close to N(0,1)).

GANs vs VAEs at a Glance

GANs produce sharper, more realistic images but are harder to train and offer less control over the latent space. VAEs are more stable and have a structured latent space (great for interpolation and editing) but outputs tend to be blurrier. Diffusion models have largely superseded both for image generation quality.

Frequently Asked Questions

What does "latent space" mean?

The latent space is a compressed, lower-dimensional representation of the data. In a VAE, images are mapped to points in this space. Points that are nearby in latent space correspond to visually similar images. You can interpolate between two points in latent space to smoothly morph between two images.

Why did GANs fall out of favour?

GANs produce sharp outputs but suffer from mode collapse, training instability, and difficult hyperparameter tuning. Diffusion models (2020+) produce comparable or better image quality with more stable training, better diversity, and easier conditioning on text prompts. For most image generation tasks, diffusion is now preferred.

Is StyleGAN still worth learning?

Yes. StyleGAN2/3 are still the best open-source models for face generation and have unique capabilities like style mixing and fine-grained attribute control. They're widely used in entertainment, avatars, and data augmentation. The architecture concepts (mapping network, adaptive instance normalisation) are also instructive.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.