Fine-Tuning & LoRA — Adapt LLMs to Your Domain

Pre-trained LLMs are generalists. Fine-tuning makes them specialists — teaching a model your company's tone, a medical domain's vocabulary, or a coding assistant's patterns. With LoRA, you can do this on a single consumer GPU.

🎯 Covers: Full Fine-Tuning · LoRA · QLoRA · PEFT Library · Hugging Face Trainer · When to Fine-Tune · Custom Datasets

When Should You Fine-Tune?

✅ Fine-Tune When…

  • You need a specific output format (JSON schema, code style)
  • The model must learn domain-specific vocabulary (medical, legal)
  • You want a consistent persona or tone across all responses
  • Prompt engineering isn't enough to get reliable behaviour
  • Latency matters and you want a smaller, faster specialised model

⚠️ Consider RAG Instead When…

  • Your data changes frequently (daily news, product catalogue)
  • You need the model to cite sources
  • Your knowledge base is large (>millions of documents)
  • You don't have enough labelled training examples (<100 pairs)

Full Fine-Tuning vs LoRA vs QLoRA

MethodTrainable ParamsVRAM (7B model)QualityBest For
Full Fine-Tuning100% (~7B)~80 GB⭐⭐⭐⭐⭐Maximum quality, enterprise
LoRA0.1–1%~16 GB⭐⭐⭐⭐Research, most use cases
QLoRA0.1–1%~6 GB⭐⭐⭐½Consumer GPU, prototyping

How LoRA Works

LoRA (Low-Rank Adaptation) is elegant: instead of updating a weight matrix W directly, it adds a low-rank decomposition W + A·B where A and B are small matrices. A 4096×4096 weight matrix (16M params) might be approximated with A (4096×8) and B (8×4096) — just 65K params. During inference, you merge them back into W for zero overhead.

🔢 LoRA Matrix Decomposition

See how LoRA decomposes a large weight update into two small matrices. Drag the rank slider.

With r=4: A is 4096×4, B is 4×4096 → 32,768 trainable params vs 16,777,216 full (0.2%)

Setting Up QLoRA with PEFT

Python · Install Dependencies
pip install transformers peft bitsandbytes datasets accelerate trl
Python · Load Model in 4-bit + Configure LoRA
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, TaskType
import torch

# 4-bit quantisation config (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat4
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,   # Double quantisation
)

model_id = "meta-llama/Llama-3-8B"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                  # Rank — higher = more capacity
    lora_alpha=32,         # Scaling factor
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,553,600 || all params: 8,036,352,000 || trainable%: 0.082

Preparing Custom Dataset

Fine-tuning data must be in instruction format. The most common is chat/instruction format — a list of system, user, and assistant turns.

Python · Format Dataset for Instruction Tuning
from datasets import Dataset

# Your raw training examples
raw_data = [
    {
        "instruction": "Summarise the following legal clause in plain English:",
        "input": "The indemnifying party shall defend, indemnify and hold harmless...",
        "output": "The party agrees to protect and compensate the other side for any losses."
    },
    # ... more examples
]

def format_example(example):
    prompt = f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}</s>"""
    return {"text": prompt}

dataset = Dataset.from_list(raw_data)
dataset = dataset.map(format_example)

# Train/eval split
dataset = dataset.train_test_split(test_size=0.1)

Training with Hugging Face SFTTrainer

Python · Launch Fine-Tuning with TRL SFTTrainer
from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./llama3-legalese",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,      # Effective batch = 8
    warmup_steps=100,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    optim="paged_adamw_8bit",           # Memory-efficient optimiser
    report_to="wandb",                  # Optional: track in W&B
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    dataset_text_field="text",
    max_seq_length=2048,
    packing=True,                       # Pack multiple examples per sequence
)

trainer.train()

# Save the LoRA adapter (small — a few MB)
model.save_pretrained("./llama3-legalese-adapter")
tokenizer.save_pretrained("./llama3-legalese-adapter")

Merging LoRA Adapters for Deployment

Python · Merge Adapter into Base Model
from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model in full precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    torch_dtype=torch.float16,
    device_map="cpu",
)

# Load and merge LoRA adapter
model = PeftModel.from_pretrained(base_model, "./llama3-legalese-adapter")
merged_model = model.merge_and_unload()   # Fuse A·B back into W

# Save full merged model for deployment
merged_model.save_pretrained("./llama3-legalese-merged")
print("Merged model saved — ready for vLLM or ONNX export")

LoRA Hyperparameter Guide

r — Rank

Controls capacity. Start with 8–16. Higher rank = more parameters but risk of overfitting. Use 64+ only for complex tasks.

lora_alpha — Scale

Usually set to 2× rank. Acts as a learning rate scaler. Higher alpha = stronger adapter signal relative to base model.

target_modules — Layers

At minimum, target q_proj and v_proj. Adding k_proj, o_proj, and MLP layers improves quality at higher VRAM cost.

lora_dropout — Regularisation

0.05–0.1 for small datasets, 0 for large ones. Prevents overfitting to training examples.

Frequently Asked Questions

How much training data do I need?

For instruction tuning: 500–2000 high-quality examples usually beats 10,000 low-quality ones. For domain adaptation (learning style/vocabulary), 1000+ examples. For behaviour cloning (copying a specific output pattern), as few as 50–100 very consistent examples can work. Quality beats quantity every time.

Can I fine-tune GPT-4 or Claude?

Yes. OpenAI supports fine-tuning for GPT-3.5-turbo, GPT-4o, and GPT-4o-mini via its API. Anthropic offers Claude fine-tuning for select models through Amazon Bedrock and the Claude API (availability varies by model — check the current docs). For full control and no per-token training fees, use open-source models like Llama 3, Mistral, Qwen, or Gemma with LoRA/QLoRA.

What's the difference between SFT and RLHF?

Supervised Fine-Tuning (SFT) trains on input→output pairs — the model learns to imitate. RLHF (Reinforcement Learning from Human Feedback) uses a reward model trained on human preferences to further align the model's outputs. Most production pipelines do SFT first, then optionally RLHF or DPO (Direct Preference Optimisation) for alignment.

Why does my fine-tuned model forget things (catastrophic forgetting)?

When fine-tuning on a narrow domain, the model can overwrite general knowledge. Mitigations: use LoRA (which leaves base weights intact), mix in 10–20% general instruction data alongside domain data, use a small learning rate, and don't train too many epochs. EWC (Elastic Weight Consolidation) is a research technique that also helps.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.