Phase 9: Generative AI
Generative AI creates new data — images, audio, video, 3D objects — indistinguishable from human-made content. GANs pioneered the field; Diffusion Models took it mainstream; Multimodal models fused language and vision. Understanding how these systems work is essential for anyone building the next generation of AI products.
Build and deploy generative image and multimodal systems
6 – 10 weeks
PyTorch, Diffusers, Hugging Face, CLIP, ComfyUI
The Generative AI Landscape
Three distinct paradigms dominate generative AI, each with different training objectives, strengths, and use cases. Modern systems often combine multiple approaches.
GANs
Two networks compete: a Generator creates fake data, a Discriminator judges authenticity. Adversarial training produces sharp, realistic outputs but is notoriously hard to train.
IntermediateDiffusion Models
Learn to reverse a noise-adding process. Gradually denoise random noise into structured output. More stable training than GANs, better diversity, now the dominant paradigm.
AdvancedMultimodal Models
Bridge vision and language. CLIP aligns text and image embeddings; GPT-4V, LLaVA, and Gemini understand and generate across modalities.
AdvancedWhy Diffusion Won
GANs dominated 2014–2021 but suffered from mode collapse (generating limited variety) and training instability. Diffusion Models solved both problems:
Progressively add Gaussian noise to an image over T steps (e.g., T=1000) until it becomes pure noise. This process is fixed — no learning needed.
Train a U-Net to predict and remove noise at each step. Loss = difference between predicted noise and actual noise added.
Start from pure Gaussian noise. Iteratively apply the learned denoising function T times to produce a clean image.
Topics in This Phase
GANs & VAEs
Generative Adversarial Networks, Variational Autoencoders, StyleGAN, CycleGAN. The foundations of generative modelling.
Read Guide →Diffusion Models
DDPM, DDIM, Latent Diffusion, Stable Diffusion, ControlNet. How text-to-image generation works under the hood.
Read Guide →Multimodal AI
CLIP, LLaVA, GPT-4V, Gemini. Systems that understand and generate across text, images, audio, and video.
Read Guide →Key Models Timeline
The Hugging Face diffusers library lets you run Stable Diffusion in 5 lines of Python. Start there to build intuition for the pipeline before diving into the math. The DDPM paper is the essential theory reference.
Frequently Asked Questions
What's the difference between Stable Diffusion and DALL-E?
Both use diffusion models, but Stable Diffusion operates in latent space (a compressed representation), making it much faster and cheaper. It's also fully open-source. DALL-E 3 (OpenAI) is closed-source and API-only, but integrated with ChatGPT for easier use. Stable Diffusion gives more control; DALL-E is simpler to use.
Are GANs still relevant in 2024?
For image generation, diffusion models have largely replaced GANs due to better diversity and stability. However, GANs are still widely used in video super-resolution, image-to-image translation, and real-time applications where diffusion's iterative inference is too slow. StyleGAN3 is still used for face generation tasks.
What compute do I need to train a diffusion model?
Training Stable Diffusion from scratch requires hundreds of A100-GPU-days (~$100K+). Fine-tuning (DreamBooth, LoRA) on custom data takes 30 minutes on a single A100 or ~$5 on cloud GPUs. For inference, a 4GB GPU is enough for 512×512 images.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.