RL Fundamentals

Before diving into deep neural networks and RLHF, you need to understand the mathematical foundation of reinforcement learning: Markov Decision Processes, value functions, and the Bellman equations. These concepts underpin every RL algorithm from Q-learning to PPO.

Markov Decision Processes (MDPs)

Every RL problem is formally described as an MDP, defined by the tuple (S, A, P, R, γ):

SState SpaceAll possible situations the agent can be in

AAction SpaceAll possible actions the agent can take

P(s'|s,a)Transition FunctionProbability of reaching state s' from s taking action a

R(s,a,s')Reward FunctionImmediate reward for taking action a in state s

γDiscount FactorHow much to value future rewards (0=myopic, 1=far-sighted)

The Markov property: the future depends only on the current state, not the history. P(sₜ₊₁ | sₜ, aₜ) = P(sₜ₊₁ | s₀, a₀, ..., sₜ, aₜ). This assumption makes the math tractable.

Value Functions

Value functions estimate how good it is to be in a state (or take an action in a state) when following a particular policy π:

State Value V^π(s)

Expected cumulative return starting from state s and following policy π:

V^π(s) = 𝔼_π [Σ γᵗ rₜ | s₀ = s]

Action Value Q^π(s,a)

Expected cumulative return starting from state s, taking action a, then following π:

Q^π(s,a) = 𝔼_π [Σ γᵗ rₜ | s₀=s, a₀=a]

The Bellman Equations

The Bellman equations express value functions recursively — the value of a state equals immediate reward plus discounted value of next states. This recursive structure is the foundation of virtually every RL algorithm.

Bellman Expectation Equation: V^π(s) = Σₐ π(a|s) Σₛ' P(s'|s,a) [R(s,a,s') + γ V^π(s')]

Bellman Optimality Equation: V*(s) = max_a Σₛ' P(s'|s,a) [R(s,a,s') + γ V*(s')]

The optimal policy can be derived from V*: at each state, pick the action that maximises expected return. The challenge: we rarely know P(s'|s,a) in real problems — we must estimate V* from experience.

Temporal Difference Learning

TD learning estimates value functions by bootstrapping — using the current value estimate to update itself:

# TD(0) update: estimate V(s) from experience
# After observing transition (s, a, r, s'):

td_target = r + gamma * V[s_next]   # what we observed
td_error  = td_target - V[s]        # how wrong our estimate was
V[s]     += alpha * td_error        # update towards target

# alpha = learning rate (e.g. 0.01)
# gamma = discount factor (e.g. 0.99)

Q-Learning

Q-learning directly learns the optimal action-value function Q*(s,a) without needing a model of the environment. It's off-policy — it learns from any experience regardless of how actions were chosen:

import numpy as np

# Q-table for discrete state/action spaces
Q = np.zeros((n_states, n_actions))

for episode in range(n_episodes):
    state = env.reset()
    done = False
    while not done:
        # Epsilon-greedy exploration
        if np.random.random() < epsilon:
            action = env.action_space.sample()  # explore
        else:
            action = np.argmax(Q[state])         # exploit

        next_state, reward, done, _ = env.step(action)

        # Q-learning update
        td_target = reward + gamma * np.max(Q[next_state])
        Q[state, action] += alpha * (td_target - Q[state, action])

        state = next_state

Exploration vs Exploitation

🎲

ε-Greedy

With probability ε, take a random action (explore). Otherwise, take the best known action (exploit). Decay ε over training as knowledge grows.

🔢

UCB (Upper Confidence Bound)

Choose actions with uncertainty bonuses: prefer actions with high estimated value OR high uncertainty. More principled than ε-greedy.

🌡️

Boltzmann / Softmax

Sample actions proportionally to exp(Q/temperature). High temperature = more random. Used in LLM sampling (temperature parameter).

💡 Start with Gymnasium

Implement Q-learning on FrozenLake-v1 (discrete, small state space) then Taxi-v3. These environments are simple enough to visualise and debug, but rich enough to teach all fundamental RL concepts. Once Q-tables solve them, you'll understand why deep RL (DQN) is needed for larger spaces.

Frequently Asked Questions

What is the difference between on-policy and off-policy learning?

On-policy methods (SARSA, PPO) learn the value of the policy currently being used for action selection — they can only learn from experience generated by their own policy. Off-policy methods (Q-learning, DQN) can learn from experience generated by any policy (e.g., stored in a replay buffer from an old policy). Off-policy is more sample-efficient but can be less stable.

When does Q-learning fail?

Q-learning uses a table indexed by (state, action) pairs — it only works when the state space is small and discrete. For continuous states (robot joint angles, pixel observations), you need function approximation. That's where Deep Q-Networks (DQN) come in: replace the Q-table with a neural network. However, this introduces training instability, addressed by experience replay and a target network.

What does γ = 0.99 mean in practice?

With γ=0.99, a reward 100 steps in the future is worth 0.99^100 ≈ 0.37 of its face value. So the agent cares about the future but discounts it. γ=1.0 gives equal weight to all future rewards (only valid for finite-horizon problems). γ=0.0 is purely myopic (only care about immediate reward). For most problems, γ ∈ [0.95, 0.999] balances short- and long-term optimisation.

RL Fundamentals

Markov Decision Processes (MDPs)

Value Functions

State Value V^π(s)

Action Value Q^π(s,a)

The Bellman Equations

Temporal Difference Learning

Q-Learning

Exploration vs Exploitation

ε-Greedy

UCB (Upper Confidence Bound)

Boltzmann / Softmax

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?