Fine-Tuning & LoRA — Adapt LLMs to Your Domain
Pre-trained LLMs are generalists. Fine-tuning makes them specialists — teaching a model your company's tone, a medical domain's vocabulary, or a coding assistant's patterns. With LoRA, you can do this on a single consumer GPU.
When Should You Fine-Tune?
✅ Fine-Tune When…
- You need a specific output format (JSON schema, code style)
- The model must learn domain-specific vocabulary (medical, legal)
- You want a consistent persona or tone across all responses
- Prompt engineering isn't enough to get reliable behaviour
- Latency matters and you want a smaller, faster specialised model
⚠️ Consider RAG Instead When…
- Your data changes frequently (daily news, product catalogue)
- You need the model to cite sources
- Your knowledge base is large (>millions of documents)
- You don't have enough labelled training examples (<100 pairs)
Full Fine-Tuning vs LoRA vs QLoRA
| Method | Trainable Params | VRAM (7B model) | Quality | Best For |
|---|---|---|---|---|
| Full Fine-Tuning | 100% (~7B) | ~80 GB | ⭐⭐⭐⭐⭐ | Maximum quality, enterprise |
| LoRA | 0.1–1% | ~16 GB | ⭐⭐⭐⭐ | Research, most use cases |
| QLoRA | 0.1–1% | ~6 GB | ⭐⭐⭐½ | Consumer GPU, prototyping |
How LoRA Works
LoRA (Low-Rank Adaptation) is elegant: instead of updating a weight matrix W directly, it adds a low-rank decomposition W + A·B where A and B are small matrices. A 4096×4096 weight matrix (16M params) might be approximated with A (4096×8) and B (8×4096) — just 65K params. During inference, you merge them back into W for zero overhead.
🔢 LoRA Matrix Decomposition
See how LoRA decomposes a large weight update into two small matrices. Drag the rank slider.
Setting Up QLoRA with PEFT
pip install transformers peft bitsandbytes datasets accelerate trl from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, TaskType
import torch
# 4-bit quantisation config (QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Double quantisation
)
model_id = "meta-llama/Llama-3-8B"
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank — higher = more capacity
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,553,600 || all params: 8,036,352,000 || trainable%: 0.082 Preparing Custom Dataset
Fine-tuning data must be in instruction format. The most common is chat/instruction format — a list of system, user, and assistant turns.
from datasets import Dataset
# Your raw training examples
raw_data = [
{
"instruction": "Summarise the following legal clause in plain English:",
"input": "The indemnifying party shall defend, indemnify and hold harmless...",
"output": "The party agrees to protect and compensate the other side for any losses."
},
# ... more examples
]
def format_example(example):
prompt = f"""### Instruction:
{example['instruction']}
### Input:
{example['input']}
### Response:
{example['output']}</s>"""
return {"text": prompt}
dataset = Dataset.from_list(raw_data)
dataset = dataset.map(format_example)
# Train/eval split
dataset = dataset.train_test_split(test_size=0.1) Training with Hugging Face SFTTrainer
from trl import SFTTrainer
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./llama3-legalese",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # Effective batch = 8
warmup_steps=100,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="epoch",
optim="paged_adamw_8bit", # Memory-efficient optimiser
report_to="wandb", # Optional: track in W&B
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
dataset_text_field="text",
max_seq_length=2048,
packing=True, # Pack multiple examples per sequence
)
trainer.train()
# Save the LoRA adapter (small — a few MB)
model.save_pretrained("./llama3-legalese-adapter")
tokenizer.save_pretrained("./llama3-legalese-adapter") Merging LoRA Adapters for Deployment
from peft import PeftModel
from transformers import AutoModelForCausalLM
# Load base model in full precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
torch_dtype=torch.float16,
device_map="cpu",
)
# Load and merge LoRA adapter
model = PeftModel.from_pretrained(base_model, "./llama3-legalese-adapter")
merged_model = model.merge_and_unload() # Fuse A·B back into W
# Save full merged model for deployment
merged_model.save_pretrained("./llama3-legalese-merged")
print("Merged model saved — ready for vLLM or ONNX export") LoRA Hyperparameter Guide
r — Rank
Controls capacity. Start with 8–16. Higher rank = more parameters but risk of overfitting. Use 64+ only for complex tasks.
lora_alpha — Scale
Usually set to 2× rank. Acts as a learning rate scaler. Higher alpha = stronger adapter signal relative to base model.
target_modules — Layers
At minimum, target q_proj and v_proj. Adding k_proj, o_proj, and MLP layers improves quality at higher VRAM cost.
lora_dropout — Regularisation
0.05–0.1 for small datasets, 0 for large ones. Prevents overfitting to training examples.
Frequently Asked Questions
How much training data do I need?
For instruction tuning: 500–2000 high-quality examples usually beats 10,000 low-quality ones. For domain adaptation (learning style/vocabulary), 1000+ examples. For behaviour cloning (copying a specific output pattern), as few as 50–100 very consistent examples can work. Quality beats quantity every time.
Can I fine-tune GPT-4 or Claude?
Yes. OpenAI supports fine-tuning for GPT-3.5-turbo, GPT-4o, and GPT-4o-mini via its API. Anthropic offers Claude fine-tuning for select models through Amazon Bedrock and the Claude API (availability varies by model — check the current docs). For full control and no per-token training fees, use open-source models like Llama 3, Mistral, Qwen, or Gemma with LoRA/QLoRA.
What's the difference between SFT and RLHF?
Supervised Fine-Tuning (SFT) trains on input→output pairs — the model learns to imitate. RLHF (Reinforcement Learning from Human Feedback) uses a reward model trained on human preferences to further align the model's outputs. Most production pipelines do SFT first, then optionally RLHF or DPO (Direct Preference Optimisation) for alignment.
Why does my fine-tuned model forget things (catastrophic forgetting)?
When fine-tuning on a narrow domain, the model can overwrite general knowledge. Mitigations: use LoRA (which leaves base weights intact), mix in 10–20% general instruction data alongside domain data, use a small learning rate, and don't train too many epochs. EWC (Elastic Weight Consolidation) is a research technique that also helps.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.