Phase 10: Reinforcement Learning
Reinforcement Learning is how AI systems learn from interaction and feedback rather than labelled data. RL powers AlphaGo, game-playing agents, robotics — and crucially, it's the backbone of RLHF, the technique that turns a base LLM into a helpful AI assistant. Understanding RL is now essential for anyone serious about how frontier AI models are trained.
Understand RL fundamentals through RLHF and modern alignment techniques
6 – 10 weeks
Gymnasium, Stable-Baselines3, trl (Hugging Face), PyTorch
The RL Framework
RL is fundamentally different from supervised learning. There are no labelled examples — instead, an agent learns by taking actions in an environment and receiving rewards:
The agent's goal: learn a policy π(a|s) that maximises cumulative discounted reward:
G = r₀ + γr₁ + γ²r₂ + ... = Σ γᵗ rₜ
γ (discount factor) ∈ [0,1] controls how much the agent values future rewards vs immediate rewards.
Topics in This Phase
RL Fundamentals
MDPs, Bellman equations, Q-learning, SARSA, temporal difference learning. The mathematical foundations of RL.
Read Guide →Deep RL
DQN, Policy Gradients, Actor-Critic, PPO, SAC. Scaling RL with neural networks to solve complex continuous control problems.
Read Guide →RLHF & Alignment
Reinforcement Learning from Human Feedback, reward modelling, PPO for LLMs, DPO, Constitutional AI. How modern LLMs are aligned.
Read Guide →Where RL Is Used in AI
RLHF is how GPT-4, Claude, and Gemini are trained. DeepSeek-R1 and OpenAI's o3 use RL-based reasoning (GRPO, process reward models) to achieve chain-of-thought at test time. RL is no longer a niche academic topic — it's at the heart of frontier AI training pipelines.
Frequently Asked Questions
Is RL harder to learn than supervised learning?
Yes, significantly. RL has non-stationary training distributions, sparse rewards, high variance, and many unstable training dynamics. Starting with Gymnasium (CartPole, LunarLander) using Stable-Baselines3 gives intuition before tackling the math. The Spinning Up in RL resource (OpenAI) and Sutton & Barto's textbook are the canonical learning paths.
What's the difference between model-based and model-free RL?
Model-free RL (Q-learning, PPO) learns directly from experience without building an explicit model of the environment. It's simpler but sample-inefficient. Model-based RL (Dyna, MuZero) builds an internal world model and can plan ahead — much more sample-efficient but harder to implement. AlphaZero/AlphaGo use model-based RL (MCTS + learned value/policy networks).
Do I need to understand RL math to work with RLHF?
For using RLHF tools (trl library, OpenRLHF) to fine-tune LLMs: no, you can treat it as a pipeline. For building or improving RLHF systems, understanding PPO's clipped surrogate objective and KL divergence penalty is important. For cutting-edge alignment research (DPO, GRPO, reward hacking), deep RL theory knowledge becomes essential.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.