What Is RAG?
Retrieval-augmented generation (RAG) is a pattern where you combine a language model with an external knowledge source. When a user asks a question, the system first searches a document store for relevant passages, then feeds those passages to the LLM as context alongside the question. The model generates its answer grounded in the retrieved content.
This is useful because LLMs have a fixed training cutoff and a limited context window. RAG lets you bring in up-to-date, domain-specific information at query time — company docs, research papers, product manuals, legal contracts — and get answers that reference that material directly.
How the Pipeline Works
A RAG system has two main phases: ingestion (preparing your documents) and retrieval + generation (answering queries).
Ingestion
- Load documents — pull in your source material: PDFs, web pages, markdown files, database records. Libraries like LangChain and LlamaIndex have loaders for most formats.
- Chunk the text — split each document into smaller pieces, typically 200–1000 tokens. The chunk size matters: too large and you dilute relevance, too small and you lose context.
- Generate embeddings — run each chunk through an embedding model (e.g.
all-MiniLM-L6-v2, OpenAI'stext-embedding-3-small) to produce a dense vector that captures semantic meaning. - Store in a vector database — save the vectors and their associated text in a vector store like ChromaDB, Pinecone, FAISS, or pgvector. This is your searchable knowledge base.
Retrieval + Generation
- Embed the query — when a user asks a question, embed it using the same model you used for the documents.
- Search for relevant chunks — perform a similarity search (cosine similarity or approximate nearest neighbors) against your vector store. Retrieve the top-k most relevant chunks.
- Build the prompt — construct a prompt that includes the retrieved chunks as context, along with the user's question and any system instructions.
- Generate the answer — send the prompt to the LLM. The model uses the provided context to produce a grounded response.
# Simplified RAG query flow
query = "What are the key risk factors in the 2024 annual report?"
# 1. Embed the query
query_vector = embedding_model.encode(query)
# 2. Retrieve relevant chunks
results = vector_store.similarity_search(query_vector, k=5)
# 3. Build prompt with context
context = "\n\n".join([r.text for r in results])
prompt = f"Based on the following context, answer the question.\n\n{context}\n\nQuestion: {query}"
# 4. Generate
answer = llm.generate(prompt)
Key Decisions When Building a RAG System
Chunking Strategy
Chunking is one of the most impactful decisions. A straightforward approach is fixed-size chunks with some overlap (e.g. 500 tokens with 50-token overlap). The overlap ensures that sentences split across chunk boundaries still appear in at least one chunk.
For structured documents, semantic chunking works better — splitting on section headers, paragraphs, or logical boundaries. This preserves the natural units of the document and tends to produce more coherent retrieval results.
Embedding Model
The embedding model determines how well your search captures meaning.
Smaller models like all-MiniLM-L6-v2 are fast and run locally.
Larger models like OpenAI's embedding APIs or e5-large-v2
capture more nuance. The trade-off is speed and cost versus retrieval quality.
One important detail: use the same embedding model for both ingestion and querying. Vectors from different models live in different vector spaces and produce meaningless similarity scores when compared.
How Many Chunks to Retrieve
Retrieving too few chunks means you might miss relevant information. Retrieving too many floods the context window with marginally relevant text, which can confuse the model or push out the truly relevant passages.
A good starting point is 3–5 chunks. From there, tune based on your use case. You can also add a re-ranking step: retrieve a larger set (say 20), then use a cross-encoder to score each chunk against the query and keep the top 5.
Vector Store
For prototypes and small datasets, ChromaDB or FAISS work well — they run locally with minimal setup. For production with larger datasets, managed services like Pinecone or Weaviate handle scaling and persistence. If you're already using PostgreSQL, pgvector lets you store embeddings alongside your relational data in a single database.
What I Learned Building One
I built a RAG system for my Financial Risk Copilot project, where users can ask natural language questions about 10-K filings and annual reports. Here are the things that stood out:
Chunk size had the biggest impact on answer quality
I started with 1000-token chunks and the answers were vague — the retrieved context had relevant keywords but too much surrounding noise. Dropping to 400-token chunks with 50-token overlap gave much more focused results. The LLM could zero in on the specific passage that answered the question.
The prompt template matters more than you'd expect
Small changes in how you present the context to the LLM make a real difference. Adding instructions like "Only answer based on the provided context" and "If the context doesn't contain the answer, say so" dramatically reduced hallucination. Numbering the source chunks and asking the model to cite which chunk it used made answers more traceable.
Evaluation is the hard part
Building the pipeline is relatively fast with LangChain and ChromaDB. The harder work is evaluating whether the system actually gives good answers. I built a small test set of question-answer pairs from the documents and scored retrieval recall (did the right chunk appear in the top-k?) and answer correctness (did the LLM produce an accurate response?). Automated evaluation with an LLM-as-judge helped scale this.
Local embeddings kept things simple
Using all-MiniLM-L6-v2 from sentence-transformers meant
everything ran locally with zero API costs during development. The retrieval
quality was good enough for my use case. For a production system with more
diverse documents, upgrading to a larger embedding model would be worth
benchmarking.
When RAG Makes Sense
RAG is a strong fit when you have a specific knowledge base that changes over time and you want an LLM to answer questions grounded in that data. Document QA, customer support over product docs, internal knowledge bases, and research assistants are all natural use cases.
For tasks where the knowledge is static and well-bounded, fine-tuning the model on that data can also work. RAG and fine-tuning complement each other — you can fine-tune for tone and reasoning patterns while using RAG for factual grounding. They solve different parts of the problem.
Resources
- LangChain documentation — the most popular orchestration framework for RAG
- Sentence Transformers — local embedding models
- ChromaDB — lightweight vector store, great for getting started