Phase 10: Reinforcement Learning

Reinforcement Learning is how AI systems learn from interaction and feedback rather than labelled data. RL powers AlphaGo, game-playing agents, robotics — and crucially, it's the backbone of RLHF, the technique that turns a base LLM into a helpful AI assistant. Understanding RL is now essential for anyone serious about how frontier AI models are trained.

🎯

Goal

Understand RL fundamentals through RLHF and modern alignment techniques

⏱️

Time

6 – 10 weeks

🛠️

Tools

Gymnasium, Stable-Baselines3, trl (Hugging Face), PyTorch

The RL Framework

RL is fundamentally different from supervised learning. There are no labelled examples — instead, an agent learns by taking actions in an environment and receiving rewards:

Environment

State sₜ, Reward rₜ←

→Action aₜ

Agent (Policy π)

The agent's goal: learn a policy π(a|s) that maximises cumulative discounted reward:

G = r₀ + γr₁ + γ²r₂ + ... = Σ γᵗ rₜ γ (discount factor) \in [0,1] controls how much the agent values future rewards vs immediate rewards.

Topics in This Phase

📐

RL Fundamentals

MDPs, Bellman equations, Q-learning, SARSA, temporal difference learning. The mathematical foundations of RL.

Read Guide →

🤖

Deep RL

DQN, Policy Gradients, Actor-Critic, PPO, SAC. Scaling RL with neural networks to solve complex continuous control problems.

Read Guide →

🎯

RLHF & Alignment

Reinforcement Learning from Human Feedback, reward modelling, PPO for LLMs, DPO, Constitutional AI. How modern LLMs are aligned.

Read Guide →

Where RL Is Used in AI

Game PlayingMCTS + RLAlphaGo, AlphaStar, OpenAI Five

RoboticsSAC, PPODexterous manipulation, locomotion

LLM AlignmentPPO (RLHF)ChatGPT, Claude, Gemini training

Reasoning ModelsGRPO, REINFORCEDeepSeek-R1, o1/o3

RecommendationContextual BanditsNetflix, TikTok feed ranking

TradingDeep RLPortfolio optimisation, market making

💡 Why Learn RL in 2025?

RLHF is how GPT-4, Claude, and Gemini are trained. DeepSeek-R1 and OpenAI's o3 use RL-based reasoning (GRPO, process reward models) to achieve chain-of-thought at test time. RL is no longer a niche academic topic — it's at the heart of frontier AI training pipelines.

Frequently Asked Questions

Is RL harder to learn than supervised learning?

Yes, significantly. RL has non-stationary training distributions, sparse rewards, high variance, and many unstable training dynamics. Starting with Gymnasium (CartPole, LunarLander) using Stable-Baselines3 gives intuition before tackling the math. The Spinning Up in RL resource (OpenAI) and Sutton & Barto's textbook are the canonical learning paths.

What's the difference between model-based and model-free RL?

Model-free RL (Q-learning, PPO) learns directly from experience without building an explicit model of the environment. It's simpler but sample-inefficient. Model-based RL (Dyna, MuZero) builds an internal world model and can plan ahead — much more sample-efficient but harder to implement. AlphaZero/AlphaGo use model-based RL (MCTS + learned value/policy networks).

Do I need to understand RL math to work with RLHF?

For using RLHF tools (trl library, OpenRLHF) to fine-tune LLMs: no, you can treat it as a pipeline. For building or improving RLHF systems, understanding PPO's clipped surrogate objective and KL divergence penalty is important. For cutting-edge alignment research (DPO, GRPO, reward hacking), deep RL theory knowledge becomes essential.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.