You build a chatbot. You ask it about your company's internal documentation. It gives you a confident, detailed, completely fabricated answer. Welcome to the two biggest problems with Large Language Models: knowledge cutoffs and hallucinations. RAG — Retrieval-Augmented Generation — is the industry-standard solution to both, and in 2025 it's one of the most in-demand skills an AI developer can have.

The Problem With LLMs: Knowledge Cutoff and Hallucinations

Every LLM is trained on a snapshot of the internet (or other text data) up to a certain date. GPT-4's training data ends in early 2024. Claude 3.5's in early 2024. This means they genuinely don't know about things that happened after that date — recent earnings calls, new regulations, your company's Q4 strategy, or the bug fix you shipped last Tuesday.

More dangerously, LLMs hallucinate. When asked a question they don't know the answer to, they don't say "I don't know" — they generate plausible-sounding text that may be entirely wrong. This is a fundamental property of how transformer models generate tokens, not a bug that will be patched out. Confidently wrong answers are worse than no answer in most business contexts.

The core problem: LLMs know a lot about the world in general but nothing about your specific data, your documents, or recent events — and they'll make things up rather than admit ignorance.

What RAG Is and How It Solves This

RAG stands for Retrieval-Augmented Generation. The idea is simple: before the LLM answers a question, you retrieve relevant context from your own knowledge base and include it in the prompt. The LLM then generates an answer grounded in that retrieved context rather than relying solely on its training data.

Think of it like an open-book exam. Instead of forcing the model to answer from memory (which leads to hallucinations), you hand it the relevant pages from the textbook and ask it to answer based on those. The model's job shifts from "recall" to "comprehension and synthesis" — something it's much better at.

The 6-Step RAG Pipeline With Code

Here's a full RAG pipeline implementation using LangChain and ChromaDB:

Step 1: Document Loading

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain_community.document_loaders import TextLoader

# Load a single PDF
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()

# Or load an entire directory of text files
loader = DirectoryLoader("./docs", glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()

print(f"Loaded {len(documents)} documents")

Step 2: Chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # characters per chunk
    chunk_overlap=200,    # overlap keeps context at boundaries
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")

Step 3: Embedding

from langchain_openai import OpenAIEmbeddings

# Or use a free local model: HuggingFaceEmbeddings
# from langchain_community.embeddings import HuggingFaceEmbeddings
# embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Test: convert a string to a vector
vector = embeddings.embed_query("What is the refund policy?")
print(f"Embedding dimension: {len(vector)}")  # 1536 for text-embedding-3-small

Step 4: Vector Storage

from langchain_community.vectorstores import Chroma

# Create vector store and persist to disk
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Later: load existing vector store
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)

Step 5: Retrieval

# Create a retriever that fetches top 4 most similar chunks
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

# Test retrieval
query = "What is the company's remote work policy?"
relevant_docs = retriever.invoke(query)
for doc in relevant_docs:
    print(doc.page_content[:200])
    print("---")

Step 6: Generation

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt_template = """Use the following context to answer the question.
If the answer is not in the context, say "I don't have that information."
Never make up an answer.

Context:
{context}

Question: {question}

Answer:"""

prompt = PromptTemplate.from_template(prompt_template)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True
)

result = qa_chain.invoke({"query": "What is the remote work policy?"})
print(result["result"])
print("\nSources:", [doc.metadata for doc in result["source_documents"]])

Chunking Strategies: Fixed Size vs Recursive

Chunking is more important than most people realise. Poor chunking is the most common reason a RAG system gives bad answers even when the answer is in the documents.

Rule of thumb: chunk_size=1000, chunk_overlap=200 works well for most starting points. Adjust based on your documents' natural paragraph length and your embedding model's token limit.

Choosing a Vector Database: Chroma vs Pinecone

Your choice of vector database depends on scale and deployment context:

Advanced Retrieval: Hybrid Search, MMR, Re-ranking

Basic similarity search works but has limitations. Advanced retrieval techniques can significantly improve result quality:

Common Mistakes and How to Fix Them

The key insight: RAG quality is 60% about chunking and retrieval strategy, 30% about your prompt, and only 10% about which LLM you use. Invest time getting the retrieval right before tuning the generation step.

Build Production RAG Applications

Our Generative AI course walks you through building complete RAG systems from document loading to deployed application — with real code, real vector databases, and real-world projects.

View the Generative AI Course →
PC

Pal C

AI Engineer & Full-Stack Developer

Software engineer and AI specialist with 8+ years of experience. Has taught 500+ students from 15+ countries.