Phase 10: Reinforcement Learning

Reinforcement Learning is how AI systems learn from interaction and feedback rather than labelled data. RL powers AlphaGo, game-playing agents, robotics — and crucially, it's the backbone of RLHF, the technique that turns a base LLM into a helpful AI assistant. Understanding RL is now essential for anyone serious about how frontier AI models are trained.

🎯
Goal

Understand RL fundamentals through RLHF and modern alignment techniques

⏱️
Time

6 – 10 weeks

🛠️
Tools

Gymnasium, Stable-Baselines3, trl (Hugging Face), PyTorch

The RL Framework

RL is fundamentally different from supervised learning. There are no labelled examples — instead, an agent learns by taking actions in an environment and receiving rewards:

Environment
State sₜ, Reward rₜ
Action aₜ
Agent (Policy π)

The agent's goal: learn a policy π(a|s) that maximises cumulative discounted reward:

G = r₀ + γr₁ + γ²r₂ + ... = Σ γᵗ rₜ

γ (discount factor) ∈ [0,1] controls how much the agent values future rewards vs immediate rewards.

Topics in This Phase

Where RL Is Used in AI

ApplicationRL TechniqueExample
Game PlayingMCTS + RLAlphaGo, AlphaStar, OpenAI Five
RoboticsSAC, PPODexterous manipulation, locomotion
LLM AlignmentPPO (RLHF)ChatGPT, Claude, Gemini training
Reasoning ModelsGRPO, REINFORCEDeepSeek-R1, o1/o3
RecommendationContextual BanditsNetflix, TikTok feed ranking
TradingDeep RLPortfolio optimisation, market making
💡 Why Learn RL in 2025?

RLHF is how GPT-4, Claude, and Gemini are trained. DeepSeek-R1 and OpenAI's o3 use RL-based reasoning (GRPO, process reward models) to achieve chain-of-thought at test time. RL is no longer a niche academic topic — it's at the heart of frontier AI training pipelines.

Frequently Asked Questions

Is RL harder to learn than supervised learning?

Yes, significantly. RL has non-stationary training distributions, sparse rewards, high variance, and many unstable training dynamics. Starting with Gymnasium (CartPole, LunarLander) using Stable-Baselines3 gives intuition before tackling the math. The Spinning Up in RL resource (OpenAI) and Sutton & Barto's textbook are the canonical learning paths.

What's the difference between model-based and model-free RL?

Model-free RL (Q-learning, PPO) learns directly from experience without building an explicit model of the environment. It's simpler but sample-inefficient. Model-based RL (Dyna, MuZero) builds an internal world model and can plan ahead — much more sample-efficient but harder to implement. AlphaZero/AlphaGo use model-based RL (MCTS + learned value/policy networks).

Do I need to understand RL math to work with RLHF?

For using RLHF tools (trl library, OpenRLHF) to fine-tune LLMs: no, you can treat it as a pipeline. For building or improving RLHF systems, understanding PPO's clipped surrogate objective and KL divergence penalty is important. For cutting-edge alignment research (DPO, GRPO, reward hacking), deep RL theory knowledge becomes essential.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.