Fine-Tuning & LoRA — Adapt LLMs to Your Domain

Pre-trained LLMs are generalists. Fine-tuning makes them specialists — teaching a model your company's tone, a medical domain's vocabulary, or a coding assistant's patterns. With LoRA, you can do this on a single consumer GPU.

When Should You Fine-Tune?

✅ Fine-Tune When…

You need a specific output format (JSON schema, code style)
The model must learn domain-specific vocabulary (medical, legal)
You want a consistent persona or tone across all responses
Prompt engineering isn't enough to get reliable behaviour
Latency matters and you want a smaller, faster specialised model

⚠️ Consider RAG Instead When…

Your data changes frequently (daily news, product catalogue)
You need the model to cite sources
Your knowledge base is large (>millions of documents)
You don't have enough labelled training examples (<100 pairs)

Full Fine-Tuning vs LoRA vs QLoRA

Method	Trainable Params	VRAM (7B model)	Quality	Best For
Full Fine-Tuning	100% (~7B)	~80 GB	⭐⭐⭐⭐⭐	Maximum quality, enterprise
LoRA	0.1–1%	~16 GB	⭐⭐⭐⭐	Research, most use cases
QLoRA	0.1–1%	~6 GB	⭐⭐⭐½	Consumer GPU, prototyping

How LoRA Works

LoRA (Low-Rank Adaptation) is elegant: instead of updating a weight matrix W directly, it adds a low-rank decomposition W + A·B where A and B are small matrices. A 4096×4096 weight matrix (16M params) might be approximated with A (4096×8) and B (8×4096) — just 65K params. During inference, you merge them back into W for zero overhead.

🔢 LoRA Matrix Decomposition

See how LoRA decomposes a large weight update into two small matrices. Drag the rank slider.

Rank r: 4

With r=4: A is 4096×4, B is 4×4096 → 32,768 trainable params vs 16,777,216 full (0.2%)

Setting Up QLoRA with PEFT

Python · Install Dependencies

pip install transformers peft bitsandbytes datasets accelerate trl

Python · Load Model in 4-bit + Configure LoRA

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, TaskType
import torch

# 4-bit quantisation config (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat4
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,   # Double quantisation
)

model_id = "meta-llama/Llama-3-8B"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                  # Rank — higher = more capacity
    lora_alpha=32,         # Scaling factor
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,553,600 || all params: 8,036,352,000 || trainable%: 0.082

Preparing Custom Dataset

Fine-tuning data must be in instruction format. The most common is chat/instruction format — a list of system, user, and assistant turns.

Python · Format Dataset for Instruction Tuning

from datasets import Dataset

# Your raw training examples
raw_data = [
    {
        "instruction": "Summarise the following legal clause in plain English:",
        "input": "The indemnifying party shall defend, indemnify and hold harmless...",
        "output": "The party agrees to protect and compensate the other side for any losses."
    },
    # ... more examples
]

def format_example(example):
    prompt = f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}</s>"""
    return {"text": prompt}

dataset = Dataset.from_list(raw_data)
dataset = dataset.map(format_example)

# Train/eval split
dataset = dataset.train_test_split(test_size=0.1)

Training with Hugging Face SFTTrainer

Python · Launch Fine-Tuning with TRL SFTTrainer

from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./llama3-legalese",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,      # Effective batch = 8
    warmup_steps=100,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    optim="paged_adamw_8bit",           # Memory-efficient optimiser
    report_to="wandb",                  # Optional: track in W&B
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    dataset_text_field="text",
    max_seq_length=2048,
    packing=True,                       # Pack multiple examples per sequence
)

trainer.train()

# Save the LoRA adapter (small — a few MB)
model.save_pretrained("./llama3-legalese-adapter")
tokenizer.save_pretrained("./llama3-legalese-adapter")

Merging LoRA Adapters for Deployment

Python · Merge Adapter into Base Model

from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model in full precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    torch_dtype=torch.float16,
    device_map="cpu",
)

# Load and merge LoRA adapter
model = PeftModel.from_pretrained(base_model, "./llama3-legalese-adapter")
merged_model = model.merge_and_unload()   # Fuse A·B back into W

# Save full merged model for deployment
merged_model.save_pretrained("./llama3-legalese-merged")
print("Merged model saved — ready for vLLM or ONNX export")

LoRA Hyperparameter Guide

r — Rank

Controls capacity. Start with 8–16. Higher rank = more parameters but risk of overfitting. Use 64+ only for complex tasks.

lora_alpha — Scale

Usually set to 2× rank. Acts as a learning rate scaler. Higher alpha = stronger adapter signal relative to base model.

target_modules — Layers

At minimum, target q_proj and v_proj. Adding k_proj, o_proj, and MLP layers improves quality at higher VRAM cost.

lora_dropout — Regularisation

0.05–0.1 for small datasets, 0 for large ones. Prevents overfitting to training examples.

Frequently Asked Questions

How much training data do I need?

For instruction tuning: 500–2000 high-quality examples usually beats 10,000 low-quality ones. For domain adaptation (learning style/vocabulary), 1000+ examples. For behaviour cloning (copying a specific output pattern), as few as 50–100 very consistent examples can work. Quality beats quantity every time.

Can I fine-tune GPT-4 or Claude?

Yes. OpenAI supports fine-tuning for GPT-3.5-turbo, GPT-4o, and GPT-4o-mini via its API. Anthropic offers Claude fine-tuning for select models through Amazon Bedrock and the Claude API (availability varies by model — check the current docs). For full control and no per-token training fees, use open-source models like Llama 3, Mistral, Qwen, or Gemma with LoRA/QLoRA.

What's the difference between SFT and RLHF?

Supervised Fine-Tuning (SFT) trains on input→output pairs — the model learns to imitate. RLHF (Reinforcement Learning from Human Feedback) uses a reward model trained on human preferences to further align the model's outputs. Most production pipelines do SFT first, then optionally RLHF or DPO (Direct Preference Optimisation) for alignment.

Why does my fine-tuned model forget things (catastrophic forgetting)?

When fine-tuning on a narrow domain, the model can overwrite general knowledge. Mitigations: use LoRA (which leaves base weights intact), mix in 10–20% general instruction data alongside domain data, use a small learning rate, and don't train too many epochs. EWC (Elastic Weight Consolidation) is a research technique that also helps.

Fine-Tuning & LoRA — Adapt LLMs to Your Domain

When Should You Fine-Tune?

✅ Fine-Tune When…

⚠️ Consider RAG Instead When…

Full Fine-Tuning vs LoRA vs QLoRA

How LoRA Works

🔢 LoRA Matrix Decomposition

Setting Up QLoRA with PEFT

Preparing Custom Dataset

Training with Hugging Face SFTTrainer

Merging LoRA Adapters for Deployment

LoRA Hyperparameter Guide

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?