Deep Reinforcement Learning
Deep RL combines neural networks with reinforcement learning to handle high-dimensional state spaces — pixel observations, continuous control, complex games. DQN beat human Atari players in 2015. PPO trains robots to walk. These same algorithms (with modifications) train the reasoning capabilities of modern LLMs.
The Three Families of Deep RL
Value-Based
Learn Q(s,a) with a neural network. Derive policy by taking argmax over Q. Examples: DQN, Double DQN, Dueling DQN, Rainbow.
Policy-Based
Directly parameterise the policy π_θ(a|s) and optimise it with gradient ascent on expected return. Examples: REINFORCE, TRPO, PPO.
Actor-Critic
Combine both: Actor (policy) selects actions; Critic (value function) evaluates them. Reduces variance. Examples: A2C, PPO, SAC, TD3.
DQN: Deep Q-Networks
DQN (DeepMind, 2015) solved Atari games from raw pixels using two key innovations:
Experience Replay
Store all transitions (s,a,r,s') in a replay buffer. Sample random minibatches for training. Breaks temporal correlations and enables data reuse.
Target Network
A separate copy of the Q-network with frozen weights, used for computing TD targets. Updated slowly (every N steps) to stabilise training.
from stable_baselines3 import DQN
import gymnasium as gym
env = gym.make("CartPole-v1")
model = DQN("MlpPolicy", env, verbose=1,
learning_rate=1e-4,
buffer_size=100_000,
exploration_fraction=0.1)
model.learn(total_timesteps=100_000)
model.save("dqn_cartpole") Policy Gradient: REINFORCE
Policy gradient methods optimise the policy directly. The Policy Gradient Theorem gives us the gradient:
∇_θ J(θ) = 𝔼_π [ ∇_θ log π_θ(aₜ|sₜ) · Gₜ ]
Gₜ = cumulative return from step t. Increase probability of actions that led to high returns.
Problem: REINFORCE has extremely high variance. The solution is to subtract a baseline (usually the value function) — this is where Actor-Critic methods come from.
PPO: Proximal Policy Optimisation
PPO (OpenAI, 2017) is the workhorse of modern RL — simple, robust, and works across a huge range of tasks. It solves a key problem: policy gradient steps that are too large can catastrophically destroy the policy. PPO constrains update size via a clipped objective:
L^CLIP(θ) = 𝔼ₜ [min(rₜ(θ) Âₜ, clip(rₜ(θ), 1−ε, 1+ε) Âₜ)]
rₜ(θ) = ratio of new/old policy probabilities. ε = 0.2 is typical. Âₜ = advantage estimate.
The clip prevents any single update from changing the policy too drastically. PPO is also used as the RL step in RLHF for LLM alignment.
Algorithm Comparison
Training Tips
Reward Shaping
Sparse rewards (only +1 at success) are hard. Add intermediate rewards to guide learning. Be careful: poorly shaped rewards cause reward hacking.
Normalise Observations
RL is sensitive to input scale. Normalise states to zero mean / unit variance (use VecNormalize in Stable-Baselines3). This alone can 10× training speed.
Parallelise Environments
Run N environments in parallel (VecEnv) to collect diverse experience. PPO especially benefits — 8–64 parallel envs is standard for fast convergence.
Implementing PPO or SAC from scratch is instructive but time-consuming and error-prone. Use stable-baselines3 (SB3) for reliable reference implementations. Once you understand the algorithms conceptually, read SB3's source code — it's clean, well-documented, and production quality.
Frequently Asked Questions
Why is PPO used for RLHF instead of SAC?
PPO works in discrete token spaces (LLMs sample tokens), while SAC is designed for continuous action spaces. PPO's clipped objective also provides a natural constraint on how much the LLM policy changes per update — analogous to the KL penalty added in RLHF to prevent the model from drifting too far from the reference SFT model. SAC variants exist for language, but PPO remains the dominant RLHF algorithm.
What is the Advantage function?
The Advantage A(s,a) = Q(s,a) − V(s) measures how much better action a is compared to the average action in state s. Using advantage instead of raw returns reduces variance in policy gradient updates while keeping the signal. If A > 0, the action was better than average (increase its probability). If A < 0, it was worse (decrease its probability).
How long does it take to train a Deep RL agent?
Gymnasium CartPole: minutes on CPU. Atari games: hours to days on a single GPU. MuJoCo locomotion: 1–4 hours on a modern GPU. Dota 2 (OpenAI Five): 45,000 CPU years equivalent. For RLHF on a 7B LLM: hours to days on 8×A100s. Sample efficiency is the main bottleneck — model-based RL and offline RL are active areas of research addressing this.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.