Phase 9: Generative AI

Generative AI creates new data — images, audio, video, 3D objects — indistinguishable from human-made content. GANs pioneered the field; Diffusion Models took it mainstream; Multimodal models fused language and vision. Understanding how these systems work is essential for anyone building the next generation of AI products.

🎯

Goal

Build and deploy generative image and multimodal systems

⏱️

Time

6 – 10 weeks

🛠️

Tools

PyTorch, Diffusers, Hugging Face, CLIP, ComfyUI

The Generative AI Landscape

Three distinct paradigms dominate generative AI, each with different training objectives, strengths, and use cases. Modern systems often combine multiple approaches.

⚔️

GANs

Two networks compete: a Generator creates fake data, a Discriminator judges authenticity. Adversarial training produces sharp, realistic outputs but is notoriously hard to train.

Intermediate

🌫️

Diffusion Models

Learn to reverse a noise-adding process. Gradually denoise random noise into structured output. More stable training than GANs, better diversity, now the dominant paradigm.

Advanced

👁️

Multimodal Models

Bridge vision and language. CLIP aligns text and image embeddings; GPT-4V, LLaVA, and Gemini understand and generate across modalities.

Advanced

Why Diffusion Won

GANs dominated 2014–2021 but suffered from mode collapse (generating limited variety) and training instability. Diffusion Models solved both problems:

Forward Process (Training)

Progressively add Gaussian noise to an image over T steps (e.g., T=1000) until it becomes pure noise. This process is fixed — no learning needed.

↓

Reverse Process (Learning)

Train a U-Net to predict and remove noise at each step. Loss = difference between predicted noise and actual noise added.

↓

Sampling (Inference)

Start from pure Gaussian noise. Iteratively apply the learned denoising function T times to produce a clean image.

Topics in This Phase

⚔️

GANs & VAEs

Generative Adversarial Networks, Variational Autoencoders, StyleGAN, CycleGAN. The foundations of generative modelling.

Read Guide →

🌫️

Diffusion Models

DDPM, DDIM, Latent Diffusion, Stable Diffusion, ControlNet. How text-to-image generation works under the hood.

Read Guide →

👁️

Multimodal AI

CLIP, LLaVA, GPT-4V, Gemini. Systems that understand and generate across text, images, audio, and video.

Read Guide →

Key Models Timeline

GAN2014GANFirst adversarial framework

DCGAN2015GANStable CNN-based GANs

StyleGAN22020GANPhotorealistic faces

DDPM2020DiffusionDenoising diffusion probabilistic models

CLIP2021MultimodalVision-language alignment

DALL-E 22022DiffusionText-to-image at scale

Stable Diffusion2022Latent DiffusionOpen-source text-to-image

Sora2024Diffusion TransformerHigh-quality video generation

💡 Start with Stable Diffusion

The Hugging Face diffusers library lets you run Stable Diffusion in 5 lines of Python. Start there to build intuition for the pipeline before diving into the math. The DDPM paper is the essential theory reference.

Frequently Asked Questions

What's the difference between Stable Diffusion and DALL-E?

Both use diffusion models, but Stable Diffusion operates in latent space (a compressed representation), making it much faster and cheaper. It's also fully open-source. DALL-E 3 (OpenAI) is closed-source and API-only, but integrated with ChatGPT for easier use. Stable Diffusion gives more control; DALL-E is simpler to use.

Are GANs still relevant in 2024?

For image generation, diffusion models have largely replaced GANs due to better diversity and stability. However, GANs are still widely used in video super-resolution, image-to-image translation, and real-time applications where diffusion's iterative inference is too slow. StyleGAN3 is still used for face generation tasks.

What compute do I need to train a diffusion model?

Training Stable Diffusion from scratch requires hundreds of A100-GPU-days (~$100K+). Fine-tuning (DreamBooth, LoRA) on custom data takes 30 minutes on a single A100 or ~$5 on cloud GPUs. For inference, a 4GB GPU is enough for 512×512 images.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.