RL Fundamentals

Before diving into deep neural networks and RLHF, you need to understand the mathematical foundation of reinforcement learning: Markov Decision Processes, value functions, and the Bellman equations. These concepts underpin every RL algorithm from Q-learning to PPO.

Prerequisites: Linear algebra, probability theory, Python. Deep learning knowledge is NOT required for this page — classical RL is math + Python, no GPUs needed.

Markov Decision Processes (MDPs)

Every RL problem is formally described as an MDP, defined by the tuple (S, A, P, R, γ):

SymbolNameMeaning
SState SpaceAll possible situations the agent can be in
AAction SpaceAll possible actions the agent can take
P(s'|s,a)Transition FunctionProbability of reaching state s' from s taking action a
R(s,a,s')Reward FunctionImmediate reward for taking action a in state s
γDiscount FactorHow much to value future rewards (0=myopic, 1=far-sighted)

The Markov property: the future depends only on the current state, not the history. P(sₜ₊₁ | sₜ, aₜ) = P(sₜ₊₁ | s₀, a₀, ..., sₜ, aₜ). This assumption makes the math tractable.

Value Functions

Value functions estimate how good it is to be in a state (or take an action in a state) when following a particular policy π:

State Value V^π(s)

Expected cumulative return starting from state s and following policy π:

V^π(s) = 𝔼_π [Σ γᵗ rₜ | s₀ = s]

Action Value Q^π(s,a)

Expected cumulative return starting from state s, taking action a, then following π:

Q^π(s,a) = 𝔼_π [Σ γᵗ rₜ | s₀=s, a₀=a]

The Bellman Equations

The Bellman equations express value functions recursively — the value of a state equals immediate reward plus discounted value of next states. This recursive structure is the foundation of virtually every RL algorithm.

Bellman Expectation Equation:

V^π(s) = Σₐ π(a|s) Σₛ' P(s'|s,a) [R(s,a,s') + γ V^π(s')]

Bellman Optimality Equation:

V*(s) = max_a Σₛ' P(s'|s,a) [R(s,a,s') + γ V*(s')]

The optimal policy can be derived from V*: at each state, pick the action that maximises expected return. The challenge: we rarely know P(s'|s,a) in real problems — we must estimate V* from experience.

Temporal Difference Learning

TD learning estimates value functions by bootstrapping — using the current value estimate to update itself:

# TD(0) update: estimate V(s) from experience
# After observing transition (s, a, r, s'):

td_target = r + gamma * V[s_next]   # what we observed
td_error  = td_target - V[s]        # how wrong our estimate was
V[s]     += alpha * td_error        # update towards target

# alpha = learning rate (e.g. 0.01)
# gamma = discount factor (e.g. 0.99)

Q-Learning

Q-learning directly learns the optimal action-value function Q*(s,a) without needing a model of the environment. It's off-policy — it learns from any experience regardless of how actions were chosen:

import numpy as np

# Q-table for discrete state/action spaces
Q = np.zeros((n_states, n_actions))

for episode in range(n_episodes):
    state = env.reset()
    done = False
    while not done:
        # Epsilon-greedy exploration
        if np.random.random() < epsilon:
            action = env.action_space.sample()  # explore
        else:
            action = np.argmax(Q[state])         # exploit

        next_state, reward, done, _ = env.step(action)

        # Q-learning update
        td_target = reward + gamma * np.max(Q[next_state])
        Q[state, action] += alpha * (td_target - Q[state, action])

        state = next_state

Exploration vs Exploitation

🎲

ε-Greedy

With probability ε, take a random action (explore). Otherwise, take the best known action (exploit). Decay ε over training as knowledge grows.

🔢

UCB (Upper Confidence Bound)

Choose actions with uncertainty bonuses: prefer actions with high estimated value OR high uncertainty. More principled than ε-greedy.

🌡️

Boltzmann / Softmax

Sample actions proportionally to exp(Q/temperature). High temperature = more random. Used in LLM sampling (temperature parameter).

💡 Start with Gymnasium

Implement Q-learning on FrozenLake-v1 (discrete, small state space) then Taxi-v3. These environments are simple enough to visualise and debug, but rich enough to teach all fundamental RL concepts. Once Q-tables solve them, you'll understand why deep RL (DQN) is needed for larger spaces.

Frequently Asked Questions

What is the difference between on-policy and off-policy learning?

On-policy methods (SARSA, PPO) learn the value of the policy currently being used for action selection — they can only learn from experience generated by their own policy. Off-policy methods (Q-learning, DQN) can learn from experience generated by any policy (e.g., stored in a replay buffer from an old policy). Off-policy is more sample-efficient but can be less stable.

When does Q-learning fail?

Q-learning uses a table indexed by (state, action) pairs — it only works when the state space is small and discrete. For continuous states (robot joint angles, pixel observations), you need function approximation. That's where Deep Q-Networks (DQN) come in: replace the Q-table with a neural network. However, this introduces training instability, addressed by experience replay and a target network.

What does γ = 0.99 mean in practice?

With γ=0.99, a reward 100 steps in the future is worth 0.99^100 ≈ 0.37 of its face value. So the agent cares about the future but discounts it. γ=1.0 gives equal weight to all future rewards (only valid for finite-horizon problems). γ=0.0 is purely myopic (only care about immediate reward). For most problems, γ ∈ [0.95, 0.999] balances short- and long-term optimisation.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.