RAG — Retrieval-Augmented Generation

LLMs are powerful but frozen in time — they don't know about your documents, your internal database, or yesterday's news. RAG solves this by giving the LLM a way to look things up before answering, dramatically reducing hallucinations.

📖 Covers: What is RAG · Chunking · Embeddings · Vector Search · Retrieval · Generation · Evaluation

The Problem RAG Solves

❌ LLM Without RAG

Q: "What was our Q3 revenue?"

A: "I don't have access to your company's financial data, but typically Q3 revenue figures are..." [hallucination risk]

✅ LLM With RAG

Q: "What was our Q3 revenue?"

A: [Retrieves Q3_Report.pdf] "According to the Q3 2024 report, revenue was $4.2M, up 23% year-over-year."

How RAG Works — The Pipeline

🔄 RAG Pipeline — Step Through It

Click each step to see what happens at that stage.

Documents (PDFs, websites, databases) are split into chunks and stored.

Phase 1: Indexing (Do Once)

1
Load Documents

PDF, Word docs, web pages, databases, Confluence, Notion — any text source.

2
Chunk

Split documents into smaller pieces (200–1000 tokens each). Strategy matters: sentence splitting, semantic chunking, or fixed-size with overlap.

3
Embed

Convert each chunk to a vector using an embedding model (e.g., text-embedding-3-small, bge-large, Cohere Embed v3).

4
Store in Vector DB

Save vectors (and the original text) in a vector database for fast similarity search.

Phase 2: Retrieval + Generation (Per Query)

5
Embed the Query

Convert the user's question to a vector using the same embedding model.

6
Similarity Search

Find the top-K most similar chunks using cosine similarity or ANN (approximate nearest neighbour).

7
Augment Prompt

Inject the retrieved chunks into the LLM prompt as context. "Answer based only on the following documents: ..."

8
Generate Answer

LLM generates a grounded answer using only the retrieved context. Much lower hallucination rate.

Building a RAG System with LangChain

Python · Simple RAG Pipeline
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# 1. Load & chunk documents
loader = PyPDFLoader("company_report.pdf")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

# 2. Embed & store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)

# 3. Create RAG chain
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(temperature=0),
    retriever=retriever
)

# 4. Query
answer = qa_chain.run("What was Q3 revenue?")
print(answer)

Advanced RAG Techniques

Hybrid Search

Combine dense (embedding) search with sparse (BM25/TF-IDF) keyword search. Catches exact matches that semantic search misses.

Re-ranking

After retrieving top-50 chunks, use a cross-encoder re-ranker to pick the best 5. Dramatically improves relevance.

HyDE

Hypothetical Document Embeddings: generate a hypothetical answer first, embed that, then search. Improves recall.

Query Expansion

Use the LLM to rewrite or expand the query in multiple ways, then search with all variants and merge results.

Contextual Compression

After retrieval, use an LLM to extract only the relevant sentences from each chunk before injecting into context.

GraphRAG

Build a knowledge graph from documents. Enables multi-hop reasoning across entities and relationships.

Evaluating RAG Quality

Faithfulness

Is the answer supported by retrieved context? (No hallucination)

Answer Relevance

Does the answer actually address the question asked?

Context Precision

Are the retrieved chunks actually relevant?

Context Recall

Did retrieval find all the relevant information?

Use RAGAS (ragas.io) to automatically evaluate all four metrics using LLM judges.

Frequently Asked Questions

What chunk size should I use?

200–500 tokens for precise fact retrieval. 500–1000 tokens for complex reasoning. Always add 20% overlap between chunks so context isn't cut off at boundaries. Test different sizes — the optimal depends on your documents and queries.

RAG vs fine-tuning: which should I use?

RAG: when data changes frequently, you need source citations, or data is too large for context. Fine-tuning: when you need to change the model's behaviour or style, or your data is relatively static. Most production systems use both: fine-tune for style, RAG for knowledge.

How do I handle multi-document questions?

Multi-hop RAG: first retrieve relevant documents, then re-query with that context to find additional relevant documents. Or use structured knowledge graphs (GraphRAG) for complex entity relationships. LangGraph's iterative retrieval helps with multi-step reasoning.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.