RL Fundamentals
Before diving into deep neural networks and RLHF, you need to understand the mathematical foundation of reinforcement learning: Markov Decision Processes, value functions, and the Bellman equations. These concepts underpin every RL algorithm from Q-learning to PPO.
Markov Decision Processes (MDPs)
Every RL problem is formally described as an MDP, defined by the tuple (S, A, P, R, γ):
The Markov property: the future depends only on the current state, not the history. P(sₜ₊₁ | sₜ, aₜ) = P(sₜ₊₁ | s₀, a₀, ..., sₜ, aₜ). This assumption makes the math tractable.
Value Functions
Value functions estimate how good it is to be in a state (or take an action in a state) when following a particular policy π:
State Value V^π(s)
Expected cumulative return starting from state s and following policy π:
Action Value Q^π(s,a)
Expected cumulative return starting from state s, taking action a, then following π:
The Bellman Equations
The Bellman equations express value functions recursively — the value of a state equals immediate reward plus discounted value of next states. This recursive structure is the foundation of virtually every RL algorithm.
Bellman Expectation Equation:
V^π(s) = Σₐ π(a|s) Σₛ' P(s'|s,a) [R(s,a,s') + γ V^π(s')]
Bellman Optimality Equation:
V*(s) = max_a Σₛ' P(s'|s,a) [R(s,a,s') + γ V*(s')]
The optimal policy can be derived from V*: at each state, pick the action that maximises expected return. The challenge: we rarely know P(s'|s,a) in real problems — we must estimate V* from experience.
Temporal Difference Learning
TD learning estimates value functions by bootstrapping — using the current value estimate to update itself:
# TD(0) update: estimate V(s) from experience
# After observing transition (s, a, r, s'):
td_target = r + gamma * V[s_next] # what we observed
td_error = td_target - V[s] # how wrong our estimate was
V[s] += alpha * td_error # update towards target
# alpha = learning rate (e.g. 0.01)
# gamma = discount factor (e.g. 0.99) Q-Learning
Q-learning directly learns the optimal action-value function Q*(s,a) without needing a model of the environment. It's off-policy — it learns from any experience regardless of how actions were chosen:
import numpy as np
# Q-table for discrete state/action spaces
Q = np.zeros((n_states, n_actions))
for episode in range(n_episodes):
state = env.reset()
done = False
while not done:
# Epsilon-greedy exploration
if np.random.random() < epsilon:
action = env.action_space.sample() # explore
else:
action = np.argmax(Q[state]) # exploit
next_state, reward, done, _ = env.step(action)
# Q-learning update
td_target = reward + gamma * np.max(Q[next_state])
Q[state, action] += alpha * (td_target - Q[state, action])
state = next_state Exploration vs Exploitation
ε-Greedy
With probability ε, take a random action (explore). Otherwise, take the best known action (exploit). Decay ε over training as knowledge grows.
UCB (Upper Confidence Bound)
Choose actions with uncertainty bonuses: prefer actions with high estimated value OR high uncertainty. More principled than ε-greedy.
Boltzmann / Softmax
Sample actions proportionally to exp(Q/temperature). High temperature = more random. Used in LLM sampling (temperature parameter).
Implement Q-learning on FrozenLake-v1 (discrete, small state space) then Taxi-v3. These environments are simple enough to visualise and debug, but rich enough to teach all fundamental RL concepts. Once Q-tables solve them, you'll understand why deep RL (DQN) is needed for larger spaces.
Frequently Asked Questions
What is the difference between on-policy and off-policy learning?
On-policy methods (SARSA, PPO) learn the value of the policy currently being used for action selection — they can only learn from experience generated by their own policy. Off-policy methods (Q-learning, DQN) can learn from experience generated by any policy (e.g., stored in a replay buffer from an old policy). Off-policy is more sample-efficient but can be less stable.
When does Q-learning fail?
Q-learning uses a table indexed by (state, action) pairs — it only works when the state space is small and discrete. For continuous states (robot joint angles, pixel observations), you need function approximation. That's where Deep Q-Networks (DQN) come in: replace the Q-table with a neural network. However, this introduces training instability, addressed by experience replay and a target network.
What does γ = 0.99 mean in practice?
With γ=0.99, a reward 100 steps in the future is worth 0.99^100 ≈ 0.37 of its face value. So the agent cares about the future but discounts it. γ=1.0 gives equal weight to all future rewards (only valid for finite-horizon problems). γ=0.0 is purely myopic (only care about immediate reward). For most problems, γ ∈ [0.95, 0.999] balances short- and long-term optimisation.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.