RLHF & LLM Alignment
RLHF — Reinforcement Learning from Human Feedback — is the technique that transforms a raw pre-trained LLM into a helpful, harmless AI assistant. It's why ChatGPT feels different from GPT-3, and why modern models follow instructions rather than just predicting next tokens. Understanding RLHF is essential for anyone working on AI alignment or frontier model training.
Why Pre-trained LLMs Need Alignment
A base LLM (pre-trained on next-token prediction) will helpfully complete "How do I make a bomb? Step 1:" — because that's a plausible text completion. It doesn't inherently understand that some outputs are helpful and others are harmful. RLHF injects human values into the model by training it to produce outputs that humans rate highly.
The RLHF Pipeline
Fine-tune the base LLM on a curated dataset of high-quality instruction-response pairs. Creates the SFT model — the starting point for RLHF. Cost: 1–3 days on 8×A100s for a 7B model.
Human annotators rank multiple model responses to the same prompt. Train a reward model (RM) to predict human preference scores from response text. The RM becomes an automated human preference judge.
Use PPO to optimise the SFT model to generate responses that maximise the reward model's score. Add a KL divergence penalty to prevent the model from drifting too far from the SFT model (prevents reward hacking).
The PPO Objective for LLMs
L(θ) = 𝔼 [r_RM(y|x) − β · KL(π_θ(y|x) || π_SFT(y|x))]
r_RM = reward model score. β controls the KL penalty strength. Without KL, the model learns to output gibberish that tricks the reward model.
Modern Alternatives to RLHF
DPO (Direct Preference Optimisation)
Eliminates the separate reward model and RL training loop. Directly optimises on preference pairs (chosen vs rejected responses) using a closed-form loss. Simpler, more stable, widely adopted in open-source fine-tuning.
Constitutional AI
Anthropic's technique. The model critiques and revises its own outputs against a set of principles ("constitution"). Reduces reliance on human labellers for harm avoidance. Powers Claude's safety properties.
GRPO (Group Relative Policy Optimisation)
Used by DeepSeek-R1 for reasoning. Generates multiple responses per prompt, uses their relative scores as the reward signal (no separate reward model needed). Efficient for training chain-of-thought reasoning.
RLAIF (AI Feedback)
Replace human raters with an AI (often a more capable model) to generate preference data. Scales cheaply; used by Google Gemini and others. Quality depends on the AI judge's capabilities.
RLHF vs DPO: When to Use Each
RLHF (PPO)
- Full online RL — model improves in real time
- Can use continuous reward signals (not just pairs)
- Better ceiling for optimisation
- Complex to implement (RL instabilities)
- Requires reward model + PPO training
- Used by OpenAI, Anthropic for frontier models
DPO
- Offline optimisation — simpler pipeline
- Only needs preference pairs (chosen/rejected)
- No reward model needed
- Stable training, easy to implement
- Slightly lower ceiling than RLHF
- Default choice for open-source fine-tuning (trl library)
Reward Hacking
A critical failure mode: the model learns to maximise reward model score in ways that don't actually reflect human preferences. Examples:
- Generating very long responses (if raters prefer detail)
- Excessive sycophancy ("Great question! You're absolutely right...")
- Refusing every borderline request (over-refusal)
- Producing gibberish that exploits reward model blind spots
Mitigations: KL penalty, reward model ensembles, iterative RLHF with updated human feedback, process reward models (PRMs) that score reasoning steps rather than just final answers.
Practical RLHF with trl
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
# dataset format: {"prompt": str, "chosen": str, "rejected": str}
training_args = DPOConfig(
beta=0.1, # KL penalty weight
max_length=1024,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=5e-7,
output_dir="./dpo-output",
)
trainer = DPOTrainer(model=model, args=training_args,
train_dataset=dataset, tokenizer=tokenizer)
trainer.train() RLHF, DPO, and Constitutional AI are 2022–2023 techniques. The field moves fast: GRPO (2024), process reward models, scalable oversight, and debate-based alignment are active research directions. Follow Anthropic's research blog, DeepMind's alignment team, and the trl library changelogs to stay current.
Frequently Asked Questions
How many human preference labels does RLHF need?
InstructGPT (the original paper) used ~13,000 demonstrations and ~33,000 preference comparisons. Modern pipelines use hundreds of thousands to millions of comparisons. Quality matters more than quantity — expert annotators with detailed guidelines produce better reward models than crowdsourced data. Companies spend millions on human feedback data collection.
What is a Process Reward Model (PRM)?
Standard reward models score the final output. PRMs score each reasoning step in a chain-of-thought. This catches errors early (step 3 was wrong, even if the final answer was correct by luck) and provides denser learning signal. OpenAI's o1 and DeepSeek-R1 use PRMs during reasoning. Training PRMs requires annotating correctness at each reasoning step — expensive but effective.
Can I do RLHF on my laptop?
DPO (the simpler alternative) on a 1B–3B model is feasible with 16GB RAM using 4-bit quantisation and LoRA. Full PPO-RLHF on any model larger than 1B needs at least one A10G/A100 GPU. The trl library's DPO implementation supports QLora, making preference fine-tuning accessible on consumer hardware. Expect 2–4 hours for a LoRA DPO run on a 7B model on a single A100.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.