RAG Explained: Real Knowledge for Your AI App

You build a chatbot. You ask it about your company's internal documentation. It gives you a confident, detailed, completely fabricated answer. Welcome to the two biggest problems with Large Language Models: knowledge cutoffs and hallucinations. RAG — Retrieval-Augmented Generation — is the industry-standard solution to both, and in 2025 it's one of the most in-demand skills an AI developer can have.

The Problem With LLMs: Knowledge Cutoff and Hallucinations

Every LLM is trained on a snapshot of the internet (or other text data) up to a certain date. GPT-4's training data ends in early 2024. Claude 3.5's in early 2024. This means they genuinely don't know about things that happened after that date — recent earnings calls, new regulations, your company's Q4 strategy, or the bug fix you shipped last Tuesday.

More dangerously, LLMs hallucinate. When asked a question they don't know the answer to, they don't say "I don't know" — they generate plausible-sounding text that may be entirely wrong. This is a fundamental property of how transformer models generate tokens, not a bug that will be patched out. Confidently wrong answers are worse than no answer in most business contexts.

The core problem: LLMs know a lot about the world in general but nothing about your specific data, your documents, or recent events — and they'll make things up rather than admit ignorance.

What RAG Is and How It Solves This

RAG stands for Retrieval-Augmented Generation. The idea is simple: before the LLM answers a question, you retrieve relevant context from your own knowledge base and include it in the prompt. The LLM then generates an answer grounded in that retrieved context rather than relying solely on its training data.

Think of it like an open-book exam. Instead of forcing the model to answer from memory (which leads to hallucinations), you hand it the relevant pages from the textbook and ask it to answer based on those. The model's job shifts from "recall" to "comprehension and synthesis" — something it's much better at.

The 6-Step RAG Pipeline With Code

Here's a full RAG pipeline implementation using LangChain and ChromaDB:

Step 1: Document Loading

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain_community.document_loaders import TextLoader

# Load a single PDF
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()

# Or load an entire directory of text files
loader = DirectoryLoader("./docs", glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()

print(f"Loaded {len(documents)} documents")

Step 2: Chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # characters per chunk
    chunk_overlap=200,    # overlap keeps context at boundaries
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")

Step 3: Embedding

from langchain_openai import OpenAIEmbeddings

# Or use a free local model: HuggingFaceEmbeddings
# from langchain_community.embeddings import HuggingFaceEmbeddings
# embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Test: convert a string to a vector
vector = embeddings.embed_query("What is the refund policy?")
print(f"Embedding dimension: {len(vector)}")  # 1536 for text-embedding-3-small

Step 4: Vector Storage

from langchain_community.vectorstores import Chroma

# Create vector store and persist to disk
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Later: load existing vector store
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)

Step 5: Retrieval

# Create a retriever that fetches top 4 most similar chunks
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

# Test retrieval
query = "What is the company's remote work policy?"
relevant_docs = retriever.invoke(query)
for doc in relevant_docs:
    print(doc.page_content[:200])
    print("---")

Step 6: Generation

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt_template = """Use the following context to answer the question.
If the answer is not in the context, say "I don't have that information."
Never make up an answer.

Context:
{context}

Question: {question}

Answer:"""

prompt = PromptTemplate.from_template(prompt_template)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True
)

result = qa_chain.invoke({"query": "What is the remote work policy?"})
print(result["result"])
print("\nSources:", [doc.metadata for doc in result["source_documents"]])

Chunking Strategies: Fixed Size vs Recursive

Chunking is more important than most people realise. Poor chunking is the most common reason a RAG system gives bad answers even when the answer is in the documents.

Fixed-size chunking: Split every N characters with M overlap. Simple to implement, works acceptably for uniform text. Problem: can cut mid-sentence or mid-paragraph, breaking semantic units.
Recursive character splitting: Tries to split on paragraph breaks first, then sentence breaks, then word breaks, then characters. This is the LangChain default and the right choice for most use cases.
Semantic chunking: Uses embeddings to find natural break points in meaning. More compute-intensive but produces better chunks for documents with varying structure.
Document-specific splitters: LangChain provides specialised splitters for Markdown (splits on headers), Python code (splits on functions), and HTML (splits on tags). Use these when your content has clear structure.

Rule of thumb: chunk_size=1000, chunk_overlap=200 works well for most starting points. Adjust based on your documents' natural paragraph length and your embedding model's token limit.

Choosing a Vector Database: Chroma vs Pinecone

Your choice of vector database depends on scale and deployment context:

Chroma (local/free): Runs in-process or as a local server. Zero infrastructure required — just pip install chromadb. Perfect for prototypes, personal projects, and small-scale apps under ~100k documents. The persist_directory parameter saves your vectors to disk between runs.
Pinecone (managed cloud): Fully managed, horizontally scalable to billions of vectors, with built-in metadata filtering, namespaces, and monitoring. Production choice for large-scale apps. Has a free tier sufficient for testing. Pay per vector stored and queries per second.
Qdrant: Open-source, can be self-hosted or used as a managed service. Excellent filtering capabilities and very fast. Good middle ground between Chroma (too simple) and Pinecone (too expensive for small teams).
pgvector: If you're already using PostgreSQL, pgvector adds vector similarity search as an extension. Fewer moving parts, unified data model, transaction support. Excellent for apps under ~1M vectors.

Advanced Retrieval: Hybrid Search, MMR, Re-ranking

Basic similarity search works but has limitations. Advanced retrieval techniques can significantly improve result quality:

Hybrid search: Combines dense retrieval (embeddings) with sparse retrieval (BM25 keyword search). Dense retrieval is great for semantic similarity; sparse is better for exact keyword matches. Combining them with Reciprocal Rank Fusion (RRF) produces better results than either alone.
MMR (Maximal Marginal Relevance): Instead of returning the top-4 most similar chunks (which might all be nearly identical), MMR balances relevance with diversity — ensuring the retrieved chunks cover different aspects of the topic.
Re-ranking: After retrieving top-20 candidates with cheap similarity search, use a cross-encoder model (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2) to re-rank them and return the best 4. This two-stage approach is both faster and more accurate than single-stage retrieval.

Common Mistakes and How to Fix Them

Chunks too large: Large chunks overwhelm the model with noise. If your retrieved chunks are 2,000+ characters, try reducing to 500–800 and see if answers improve.
Not enough overlap: If answers are getting cut off at chunk boundaries, increase chunk_overlap from 200 to 300–400.
Retrieving too many or too few chunks: Start with k=4. If answers miss context, try k=6–8. If answers are unfocused, try k=2–3.
Not filtering on metadata: If your knowledge base covers multiple topics or time periods, add metadata filters so the retriever only searches relevant subsets.
No "I don't know" fallback: Always include an explicit instruction in your prompt that if the answer isn't in the retrieved context, the model should say so rather than making something up.

The key insight: RAG quality is 60% about chunking and retrieval strategy, 30% about your prompt, and only 10% about which LLM you use. Invest time getting the retrieval right before tuning the generation step.

Build Production RAG Applications

Our Generative AI course walks you through building complete RAG systems from document loading to deployed application — with real code, real vector databases, and real-world projects.

View the Generative AI Course →

RAG LangChain Vector Database Generative AI Embeddings LLMs Chroma

Pal C

AI Engineer & Full-Stack Developer

Software engineer and AI specialist with 8+ years of experience. Has taught 500+ students from 15+ countries.

RAG Explained: How to Give Your AI App Real Knowledge

The Problem With LLMs: Knowledge Cutoff and Hallucinations

What RAG Is and How It Solves This

The 6-Step RAG Pipeline With Code

Step 1: Document Loading

Step 2: Chunking

Step 3: Embedding

Step 4: Vector Storage

Step 5: Retrieval

Step 6: Generation

Chunking Strategies: Fixed Size vs Recursive

Choosing a Vector Database: Chroma vs Pinecone

Advanced Retrieval: Hybrid Search, MMR, Re-ranking

Common Mistakes and How to Fix Them

Build Production RAG Applications

Pal C

Related Articles

Prompt Engineering in 2025: What Actually Works

Python for AI: The Skills That Actually Matter

Pandas vs Polars: Which Should You Learn in 2025?