RLHF & LLM Alignment

RLHF — Reinforcement Learning from Human Feedback — is the technique that transforms a raw pre-trained LLM into a helpful, harmless AI assistant. It's why ChatGPT feels different from GPT-3, and why modern models follow instructions rather than just predicting next tokens. Understanding RLHF is essential for anyone working on AI alignment or frontier model training.

Why Pre-trained LLMs Need Alignment

A base LLM (pre-trained on next-token prediction) will helpfully complete "How do I make a bomb? Step 1:" — because that's a plausible text completion. It doesn't inherently understand that some outputs are helpful and others are harmful. RLHF injects human values into the model by training it to produce outputs that humans rate highly.

The RLHF Pipeline

Supervised Fine-Tuning (SFT)

Fine-tune the base LLM on a curated dataset of high-quality instruction-response pairs. Creates the SFT model — the starting point for RLHF. Cost: 1–3 days on 8×A100s for a 7B model.

↓

Reward Model Training

Human annotators rank multiple model responses to the same prompt. Train a reward model (RM) to predict human preference scores from response text. The RM becomes an automated human preference judge.

↓

PPO Optimisation

Use PPO to optimise the SFT model to generate responses that maximise the reward model's score. Add a KL divergence penalty to prevent the model from drifting too far from the SFT model (prevents reward hacking).

The PPO Objective for LLMs

L(θ) = 𝔼 [r_RM(y|x) - β \cdot KL(π_θ(y|x) || π_SFT(y|x))] r_RM = reward model score. β controls the KL penalty strength. Without KL, the model learns to output gibberish that tricks the reward model.

Modern Alternatives to RLHF

🎯

DPO (Direct Preference Optimisation)

Eliminates the separate reward model and RL training loop. Directly optimises on preference pairs (chosen vs rejected responses) using a closed-form loss. Simpler, more stable, widely adopted in open-source fine-tuning.

📜

Constitutional AI

Anthropic's technique. The model critiques and revises its own outputs against a set of principles ("constitution"). Reduces reliance on human labellers for harm avoidance. Powers Claude's safety properties.

🧮

GRPO (Group Relative Policy Optimisation)

Used by DeepSeek-R1 for reasoning. Generates multiple responses per prompt, uses their relative scores as the reward signal (no separate reward model needed). Efficient for training chain-of-thought reasoning.

🔄

RLAIF (AI Feedback)

Replace human raters with an AI (often a more capable model) to generate preference data. Scales cheaply; used by Google Gemini and others. Quality depends on the AI judge's capabilities.

RLHF vs DPO: When to Use Each

RLHF (PPO)

Full online RL — model improves in real time
Can use continuous reward signals (not just pairs)
Better ceiling for optimisation
Complex to implement (RL instabilities)
Requires reward model + PPO training
Used by OpenAI, Anthropic for frontier models

DPO

Offline optimisation — simpler pipeline
Only needs preference pairs (chosen/rejected)
No reward model needed
Stable training, easy to implement
Slightly lower ceiling than RLHF
Default choice for open-source fine-tuning (trl library)

Reward Hacking

A critical failure mode: the model learns to maximise reward model score in ways that don't actually reflect human preferences. Examples:

Generating very long responses (if raters prefer detail)
Excessive sycophancy ("Great question! You're absolutely right...")
Refusing every borderline request (over-refusal)
Producing gibberish that exploits reward model blind spots

Mitigations: KL penalty, reward model ensembles, iterative RLHF with updated human feedback, process reward models (PRMs) that score reasoning steps rather than just final answers.

Practical RLHF with trl

from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

# dataset format: {"prompt": str, "chosen": str, "rejected": str}
training_args = DPOConfig(
    beta=0.1,              # KL penalty weight
    max_length=1024,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=5e-7,
    output_dir="./dpo-output",
)

trainer = DPOTrainer(model=model, args=training_args,
                     train_dataset=dataset, tokenizer=tokenizer)
trainer.train()

💡 Alignment is an Active Research Field

RLHF, DPO, and Constitutional AI are 2022–2023 techniques. The field moves fast: GRPO (2024), process reward models, scalable oversight, and debate-based alignment are active research directions. Follow Anthropic's research blog, DeepMind's alignment team, and the trl library changelogs to stay current.

Frequently Asked Questions

How many human preference labels does RLHF need?

InstructGPT (the original paper) used ~13,000 demonstrations and ~33,000 preference comparisons. Modern pipelines use hundreds of thousands to millions of comparisons. Quality matters more than quantity — expert annotators with detailed guidelines produce better reward models than crowdsourced data. Companies spend millions on human feedback data collection.

What is a Process Reward Model (PRM)?

Standard reward models score the final output. PRMs score each reasoning step in a chain-of-thought. This catches errors early (step 3 was wrong, even if the final answer was correct by luck) and provides denser learning signal. OpenAI's o1 and DeepSeek-R1 use PRMs during reasoning. Training PRMs requires annotating correctness at each reasoning step — expensive but effective.

Can I do RLHF on my laptop?

DPO (the simpler alternative) on a 1B–3B model is feasible with 16GB RAM using 4-bit quantisation and LoRA. Full PPO-RLHF on any model larger than 1B needs at least one A10G/A100 GPU. The trl library's DPO implementation supports QLora, making preference fine-tuning accessible on consumer hardware. Expect 2–4 hours for a LoRA DPO run on a 7B model on a single A100.

RLHF & LLM Alignment

Why Pre-trained LLMs Need Alignment

The RLHF Pipeline

The PPO Objective for LLMs

Modern Alternatives to RLHF

DPO (Direct Preference Optimisation)

Constitutional AI

GRPO (Group Relative Policy Optimisation)

RLAIF (AI Feedback)

RLHF vs DPO: When to Use Each

RLHF (PPO)

DPO

Reward Hacking

Practical RLHF with trl

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?