Deep Reinforcement Learning

Deep RL combines neural networks with reinforcement learning to handle high-dimensional state spaces — pixel observations, continuous control, complex games. DQN beat human Atari players in 2015. PPO trains robots to walk. These same algorithms (with modifications) train the reasoning capabilities of modern LLMs.

The Three Families of Deep RL

📊

Value-Based

Learn Q(s,a) with a neural network. Derive policy by taking argmax over Q. Examples: DQN, Double DQN, Dueling DQN, Rainbow.

📋

Policy-Based

Directly parameterise the policy π_θ(a|s) and optimise it with gradient ascent on expected return. Examples: REINFORCE, TRPO, PPO.

🎭

Actor-Critic

Combine both: Actor (policy) selects actions; Critic (value function) evaluates them. Reduces variance. Examples: A2C, PPO, SAC, TD3.

DQN: Deep Q-Networks

DQN (DeepMind, 2015) solved Atari games from raw pixels using two key innovations:

Experience Replay

Store all transitions (s,a,r,s') in a replay buffer. Sample random minibatches for training. Breaks temporal correlations and enables data reuse.

Target Network

A separate copy of the Q-network with frozen weights, used for computing TD targets. Updated slowly (every N steps) to stabilise training.

from stable_baselines3 import DQN
import gymnasium as gym

env = gym.make("CartPole-v1")
model = DQN("MlpPolicy", env, verbose=1,
            learning_rate=1e-4,
            buffer_size=100_000,
            exploration_fraction=0.1)
model.learn(total_timesteps=100_000)
model.save("dqn_cartpole")

Policy Gradient: REINFORCE

Policy gradient methods optimise the policy directly. The Policy Gradient Theorem gives us the gradient:

\nabla_θ J(θ) = 𝔼_π [ \nabla_θ log π_θ(aₜ|sₜ) \cdot Gₜ ] Gₜ = cumulative return from step t. Increase probability of actions that led to high returns.

Problem: REINFORCE has extremely high variance. The solution is to subtract a baseline (usually the value function) — this is where Actor-Critic methods come from.

PPO: Proximal Policy Optimisation

PPO (OpenAI, 2017) is the workhorse of modern RL — simple, robust, and works across a huge range of tasks. It solves a key problem: policy gradient steps that are too large can catastrophically destroy the policy. PPO constrains update size via a clipped objective:

L^CLIP(θ) = 𝔼ₜ [min(rₜ(θ) Âₜ, clip(rₜ(θ), 1-ε, 1+ε) Âₜ)] rₜ(θ) = ratio of new/old policy probabilities. ε = 0.2 is typical. Âₜ = advantage estimate.

The clip prevents any single update from changing the policy too drastically. PPO is also used as the RL step in RLHF for LLM alignment.

Algorithm Comparison

DQNValue-basedDiscreteStable, works from pixels

PPOActor-CriticBothRobust, simple, RLHF standard

SACActor-CriticContinuousSample-efficient, entropy bonus

TD3Actor-CriticContinuousStable continuous control

A3C/A2CActor-CriticBothParallel workers; fast wall time

GRPOPolicy GradientDiscreteLLM reasoning (DeepSeek-R1)

Training Tips

📦

Reward Shaping

Sparse rewards (only +1 at success) are hard. Add intermediate rewards to guide learning. Be careful: poorly shaped rewards cause reward hacking.

📊

Normalise Observations

RL is sensitive to input scale. Normalise states to zero mean / unit variance (use VecNormalize in Stable-Baselines3). This alone can 10× training speed.

🔀

Parallelise Environments

Run N environments in parallel (VecEnv) to collect diverse experience. PPO especially benefits — 8–64 parallel envs is standard for fast convergence.

💡 Use Stable-Baselines3

Implementing PPO or SAC from scratch is instructive but time-consuming and error-prone. Use stable-baselines3 (SB3) for reliable reference implementations. Once you understand the algorithms conceptually, read SB3's source code — it's clean, well-documented, and production quality.

Frequently Asked Questions

Why is PPO used for RLHF instead of SAC?

PPO works in discrete token spaces (LLMs sample tokens), while SAC is designed for continuous action spaces. PPO's clipped objective also provides a natural constraint on how much the LLM policy changes per update — analogous to the KL penalty added in RLHF to prevent the model from drifting too far from the reference SFT model. SAC variants exist for language, but PPO remains the dominant RLHF algorithm.

What is the Advantage function?

The Advantage A(s,a) = Q(s,a) − V(s) measures how much better action a is compared to the average action in state s. Using advantage instead of raw returns reduces variance in policy gradient updates while keeping the signal. If A > 0, the action was better than average (increase its probability). If A < 0, it was worse (decrease its probability).

How long does it take to train a Deep RL agent?

Gymnasium CartPole: minutes on CPU. Atari games: hours to days on a single GPU. MuJoCo locomotion: 1–4 hours on a modern GPU. Dota 2 (OpenAI Five): 45,000 CPU years equivalent. For RLHF on a 7B LLM: hours to days on 8×A100s. Sample efficiency is the main bottleneck — model-based RL and offline RL are active areas of research addressing this.

Deep Reinforcement Learning

The Three Families of Deep RL

Value-Based

Policy-Based

Actor-Critic

DQN: Deep Q-Networks

Experience Replay

Target Network

Policy Gradient: REINFORCE

PPO: Proximal Policy Optimisation

Algorithm Comparison

Training Tips

Reward Shaping

Normalise Observations

Parallelise Environments

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?