RAG — Retrieval-Augmented Generation
LLMs are powerful but frozen in time — they don't know about your documents, your internal database, or yesterday's news. RAG solves this by giving the LLM a way to look things up before answering, dramatically reducing hallucinations.
The Problem RAG Solves
Q: "What was our Q3 revenue?"
A: "I don't have access to your company's financial data, but typically Q3 revenue figures are..." [hallucination risk]
Q: "What was our Q3 revenue?"
A: [Retrieves Q3_Report.pdf] "According to the Q3 2024 report, revenue was $4.2M, up 23% year-over-year."
How RAG Works — The Pipeline
🔄 RAG Pipeline — Step Through It
Click each step to see what happens at that stage.
Phase 1: Indexing (Do Once)
PDF, Word docs, web pages, databases, Confluence, Notion — any text source.
Split documents into smaller pieces (200–1000 tokens each). Strategy matters: sentence splitting, semantic chunking, or fixed-size with overlap.
Convert each chunk to a vector using an embedding model (e.g., text-embedding-3-small, bge-large, Cohere Embed v3).
Save vectors (and the original text) in a vector database for fast similarity search.
Phase 2: Retrieval + Generation (Per Query)
Convert the user's question to a vector using the same embedding model.
Find the top-K most similar chunks using cosine similarity or ANN (approximate nearest neighbour).
Inject the retrieved chunks into the LLM prompt as context. "Answer based only on the following documents: ..."
LLM generates a grounded answer using only the retrieved context. Much lower hallucination rate.
Building a RAG System with LangChain
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
# 1. Load & chunk documents
loader = PyPDFLoader("company_report.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
# 2. Embed & store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)
# 3. Create RAG chain
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(temperature=0),
retriever=retriever
)
# 4. Query
answer = qa_chain.run("What was Q3 revenue?")
print(answer) Advanced RAG Techniques
Hybrid Search
Combine dense (embedding) search with sparse (BM25/TF-IDF) keyword search. Catches exact matches that semantic search misses.
Re-ranking
After retrieving top-50 chunks, use a cross-encoder re-ranker to pick the best 5. Dramatically improves relevance.
HyDE
Hypothetical Document Embeddings: generate a hypothetical answer first, embed that, then search. Improves recall.
Query Expansion
Use the LLM to rewrite or expand the query in multiple ways, then search with all variants and merge results.
Contextual Compression
After retrieval, use an LLM to extract only the relevant sentences from each chunk before injecting into context.
GraphRAG
Build a knowledge graph from documents. Enables multi-hop reasoning across entities and relationships.
Evaluating RAG Quality
Is the answer supported by retrieved context? (No hallucination)
Does the answer actually address the question asked?
Are the retrieved chunks actually relevant?
Did retrieval find all the relevant information?
Use RAGAS (ragas.io) to automatically evaluate all four metrics using LLM judges.
Frequently Asked Questions
What chunk size should I use?
200–500 tokens for precise fact retrieval. 500–1000 tokens for complex reasoning. Always add 20% overlap between chunks so context isn't cut off at boundaries. Test different sizes — the optimal depends on your documents and queries.
RAG vs fine-tuning: which should I use?
RAG: when data changes frequently, you need source citations, or data is too large for context. Fine-tuning: when you need to change the model's behaviour or style, or your data is relatively static. Most production systems use both: fine-tune for style, RAG for knowledge.
How do I handle multi-document questions?
Multi-hop RAG: first retrieve relevant documents, then re-query with that context to find additional relevant documents. Or use structured knowledge graphs (GraphRAG) for complex entity relationships. LangGraph's iterative retrieval helps with multi-step reasoning.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.