You build a chatbot. You ask it about your company's internal documentation. It gives you a confident, detailed, completely fabricated answer. Welcome to the two biggest problems with Large Language Models: knowledge cutoffs and hallucinations. RAG — Retrieval-Augmented Generation — is the industry-standard solution to both, and in 2025 it's one of the most in-demand skills an AI developer can have.
The Problem With LLMs: Knowledge Cutoff and Hallucinations
Every LLM is trained on a snapshot of the internet (or other text data) up to a certain date. GPT-4's training data ends in early 2024. Claude 3.5's in early 2024. This means they genuinely don't know about things that happened after that date — recent earnings calls, new regulations, your company's Q4 strategy, or the bug fix you shipped last Tuesday.
More dangerously, LLMs hallucinate. When asked a question they don't know the answer to, they don't say "I don't know" — they generate plausible-sounding text that may be entirely wrong. This is a fundamental property of how transformer models generate tokens, not a bug that will be patched out. Confidently wrong answers are worse than no answer in most business contexts.
What RAG Is and How It Solves This
RAG stands for Retrieval-Augmented Generation. The idea is simple: before the LLM answers a question, you retrieve relevant context from your own knowledge base and include it in the prompt. The LLM then generates an answer grounded in that retrieved context rather than relying solely on its training data.
Think of it like an open-book exam. Instead of forcing the model to answer from memory (which leads to hallucinations), you hand it the relevant pages from the textbook and ask it to answer based on those. The model's job shifts from "recall" to "comprehension and synthesis" — something it's much better at.
The 6-Step RAG Pipeline With Code
Here's a full RAG pipeline implementation using LangChain and ChromaDB:
Step 1: Document Loading
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain_community.document_loaders import TextLoader
# Load a single PDF
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()
# Or load an entire directory of text files
loader = DirectoryLoader("./docs", glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()
print(f"Loaded {len(documents)} documents")
Step 2: Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # characters per chunk
chunk_overlap=200, # overlap keeps context at boundaries
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
Step 3: Embedding
from langchain_openai import OpenAIEmbeddings
# Or use a free local model: HuggingFaceEmbeddings
# from langchain_community.embeddings import HuggingFaceEmbeddings
# embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Test: convert a string to a vector
vector = embeddings.embed_query("What is the refund policy?")
print(f"Embedding dimension: {len(vector)}") # 1536 for text-embedding-3-small
Step 4: Vector Storage
from langchain_community.vectorstores import Chroma
# Create vector store and persist to disk
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
# Later: load existing vector store
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings
)
Step 5: Retrieval
# Create a retriever that fetches top 4 most similar chunks
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4}
)
# Test retrieval
query = "What is the company's remote work policy?"
relevant_docs = retriever.invoke(query)
for doc in relevant_docs:
print(doc.page_content[:200])
print("---")
Step 6: Generation
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt_template = """Use the following context to answer the question.
If the answer is not in the context, say "I don't have that information."
Never make up an answer.
Context:
{context}
Question: {question}
Answer:"""
prompt = PromptTemplate.from_template(prompt_template)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
chain_type_kwargs={"prompt": prompt},
return_source_documents=True
)
result = qa_chain.invoke({"query": "What is the remote work policy?"})
print(result["result"])
print("\nSources:", [doc.metadata for doc in result["source_documents"]])
Chunking Strategies: Fixed Size vs Recursive
Chunking is more important than most people realise. Poor chunking is the most common reason a RAG system gives bad answers even when the answer is in the documents.
- Fixed-size chunking: Split every N characters with M overlap. Simple to implement, works acceptably for uniform text. Problem: can cut mid-sentence or mid-paragraph, breaking semantic units.
- Recursive character splitting: Tries to split on paragraph breaks first, then sentence breaks, then word breaks, then characters. This is the LangChain default and the right choice for most use cases.
- Semantic chunking: Uses embeddings to find natural break points in meaning. More compute-intensive but produces better chunks for documents with varying structure.
- Document-specific splitters: LangChain provides specialised splitters for Markdown (splits on headers), Python code (splits on functions), and HTML (splits on tags). Use these when your content has clear structure.
Choosing a Vector Database: Chroma vs Pinecone
Your choice of vector database depends on scale and deployment context:
- Chroma (local/free): Runs in-process or as a local server. Zero infrastructure required — just
pip install chromadb. Perfect for prototypes, personal projects, and small-scale apps under ~100k documents. Thepersist_directoryparameter saves your vectors to disk between runs. - Pinecone (managed cloud): Fully managed, horizontally scalable to billions of vectors, with built-in metadata filtering, namespaces, and monitoring. Production choice for large-scale apps. Has a free tier sufficient for testing. Pay per vector stored and queries per second.
- Qdrant: Open-source, can be self-hosted or used as a managed service. Excellent filtering capabilities and very fast. Good middle ground between Chroma (too simple) and Pinecone (too expensive for small teams).
- pgvector: If you're already using PostgreSQL, pgvector adds vector similarity search as an extension. Fewer moving parts, unified data model, transaction support. Excellent for apps under ~1M vectors.
Advanced Retrieval: Hybrid Search, MMR, Re-ranking
Basic similarity search works but has limitations. Advanced retrieval techniques can significantly improve result quality:
- Hybrid search: Combines dense retrieval (embeddings) with sparse retrieval (BM25 keyword search). Dense retrieval is great for semantic similarity; sparse is better for exact keyword matches. Combining them with Reciprocal Rank Fusion (RRF) produces better results than either alone.
- MMR (Maximal Marginal Relevance): Instead of returning the top-4 most similar chunks (which might all be nearly identical), MMR balances relevance with diversity — ensuring the retrieved chunks cover different aspects of the topic.
- Re-ranking: After retrieving top-20 candidates with cheap similarity search, use a cross-encoder model (e.g.,
cross-encoder/ms-marco-MiniLM-L-6-v2) to re-rank them and return the best 4. This two-stage approach is both faster and more accurate than single-stage retrieval.
Common Mistakes and How to Fix Them
- Chunks too large: Large chunks overwhelm the model with noise. If your retrieved chunks are 2,000+ characters, try reducing to 500–800 and see if answers improve.
- Not enough overlap: If answers are getting cut off at chunk boundaries, increase chunk_overlap from 200 to 300–400.
- Retrieving too many or too few chunks: Start with k=4. If answers miss context, try k=6–8. If answers are unfocused, try k=2–3.
- Not filtering on metadata: If your knowledge base covers multiple topics or time periods, add metadata filters so the retriever only searches relevant subsets.
- No "I don't know" fallback: Always include an explicit instruction in your prompt that if the answer isn't in the retrieved context, the model should say so rather than making something up.
Build Production RAG Applications
Our Generative AI course walks you through building complete RAG systems from document loading to deployed application — with real code, real vector databases, and real-world projects.
View the Generative AI Course →