RAG (Retrieval-Augmented Generation) is how you make an LLM answer using your documents instead of guessing. Done wrong, it confidently hallucinates. Done right, it behaves like a searchable, cited assistant. This guide shows the exact levers that move reliability: chunking, embeddings, retrieval, reranking, and evaluation.
Quickstart: make your RAG noticeably better in 60 minutes
If your chatbot “sometimes uses the docs, sometimes makes things up”, start here. These are the fastest changes with the biggest impact.
1) Fix chunking (the #1 silent killer)
Most “RAG failures” are actually retrieval failures caused by chunks that are too big, too small, or missing context.
- Chunk by meaning (sections, headings), not just characters
- Target 250–600 tokens per chunk (start here, then tune)
- Add overlap (10–20%) so boundary info isn’t lost
- Store the title + section header inside each chunk
2) Add metadata + filtering
Metadata makes retrieval precise. Without it, “vector search” often returns the wrong page that sounds similar.
- Store
source,url,doc_id,section,date - Use filters like
doc_id=…orproduct=…when the query allows - Keep a stable citation key (so you can display sources)
- Prefer “few clean docs” over “everything dumped in one index”
3) Add a reranker (cheap accuracy boost)
Vector search is fast but imperfect. Reranking re-sorts candidates to match the user’s exact question.
- Retrieve top K=20–50 quickly
- Rerank down to N=4–8 best chunks
- Feed only those into the LLM
- Track hit-rate (did the right chunk appear in top N?)
4) Add “answerability” guardrails
The model should say “I don’t know” when retrieval is weak. This is how you stop confident hallucinations.
- Require citations for factual answers (“no sources → no claim”)
- Use a minimum relevance threshold (score or rerank gap)
- When weak: ask a clarifying question or suggest where to look
- Log “no-answer” cases for dataset improvement
If the bot can’t cite at least one retrieved chunk that contains the answer, it should not present the answer as fact.
Overview: what RAG is (and why it beats “just prompt it”)
A normal LLM answers from its training data. That’s great for general knowledge, but bad for your internal docs, product policies, prices, or anything that changes. RAG solves this by retrieving relevant context first, then generating an answer using that context.
The RAG pipeline in one sentence
User question → retrieve relevant chunks → (optional) rerank → generate answer with citations → log & improve
What RAG is good for
| Use case | Why RAG fits | Typical “gotcha” |
|---|---|---|
| Company / internal knowledge base | Answers must match current docs | Docs are messy → chunking matters |
| Product support chatbot | Need citations to reduce “made-up” fixes | Outdated versions in index |
| Legal / policy Q&A | Must cite sources, avoid guessing | Ambiguity → needs clarifying questions |
| Developer docs assistant | Search + explain with examples | Code blocks need chunk-aware splitting |
- RAG is best when facts change, and you need citations.
- Fine-tuning is best for style, formatting, classification, and consistent behavior.
- Many teams do both: fine-tune for behavior, RAG for knowledge.
Core concepts: the parts that actually control quality
1) Embeddings: how text becomes searchable
Embeddings convert text into vectors (numbers). Similar meanings end up near each other in vector space, so a question can retrieve related passages even if it doesn’t share the same keywords.
When embeddings shine
- Queries that paraphrase doc language
- Messy natural questions
- “What’s the process for…?” style questions
When embeddings struggle
- Exact IDs, part numbers, SKUs
- Tables with tiny cells
- Highly structured data (use filters or SQL)
2) Chunking: the real secret of RAG
The model can only use what you retrieve. Chunking decides what’s retrievable. If a chunk is missing the header, the “who/what/where” context may disappear. If it’s too large, retrieval becomes noisy.
A strong default chunk recipe
- Split by headings/sections first
- Then sub-split long sections to ~250–600 tokens
- Include: doc title + section heading + content inside each chunk
- Add overlap 10–20%
3) Retrieval: getting the right context, not “similar vibes”
Retrieval is a ranking problem: you want the best chunks in the top few results. Most RAG systems fail because the right chunk exists but doesn’t appear in the top N.
Practical target metrics
Track Recall@K (did the right chunk appear in top K?) and MRR (how high did it rank?). This is more actionable than “the bot seems smarter.”
4) Generation: make the LLM follow the evidence
Even with perfect retrieval, the model can still ignore context. Your generation prompt should force: cite sources, quote specific lines when needed, and admit uncertainty when the docs don’t contain the answer.
The right chunk was retrieved, but the model answered from “general knowledge” instead. Fix this with: stronger system instruction, citations required, and shorter context (rerank → top N).
Step-by-step: build a RAG system that behaves reliably
This section is intentionally practical. You can implement it with most stacks (LangChain, LlamaIndex, custom code). The order matters: start with data quality and retrieval quality before “prompt engineering.”
Step 1 — Prepare your documents
Do this before chunking
- Remove duplicate docs and outdated versions (or tag them clearly)
- Normalize whitespace and fix broken headings
- Keep links, titles, and source info (you’ll need citations)
- Decide what should be searchable (some content shouldn’t be indexed)
Step 2 — Chunk with structure (not just length)
The goal is chunks that are meaningful and self-contained. If someone reads one chunk alone, they should understand what doc it came from and what it’s about.
Good chunk boundaries
- Section headers
- Lists / procedures
- FAQ items
- Code blocks (keep intact)
Bad chunk boundaries
- Splitting mid-list item
- Splitting a table row across chunks
- Dropping the heading from the chunk
- Mixing unrelated sections together
Step 3 — Index with metadata that helps retrieval
At minimum, store enough metadata to reconstruct a citation and apply filters. Your future self (and your users) will thank you.
{
"id": "chunk_001239",
"text": "Doc Title — Section: Refund Policy\n...\nactual content here...",
"metadata": {
"source": "handbook.pdf",
"url": "https://example.com/docs/handbook",
"doc_id": "handbook_v3",
"section": "Refund Policy",
"updated_at": "2026-01-02"
}
}
Step 4 — Retrieve, then rerank
A strong default is “wide retrieve, narrow rerank”. Wide retrieve ensures recall. Rerank makes precision high.
A safe starting configuration
- Vector retrieve: topK = 30
- Rerank: keep topN = 6
- Pass those 6 chunks to the LLM
- Show citations (source + section)
Step 5 — Force grounded answers (with citations)
Your prompt should make it easy for the model to be honest. If the context doesn’t contain the answer, it should ask a question or say it’s not in the docs.
- “Use only the provided context.”
- “Cite sources for each claim.”
- “If missing, say what’s missing and ask a clarifying question.”
Step 6 — Evaluate retrieval (not vibes)
Don’t start by rating answers subjectively. First, measure if retrieval gets the right chunks. If retrieval is wrong, generation will always be unstable.
A simple evaluation loop
- Create 30–100 representative questions users actually ask.
- For each question, label the correct source chunk(s).
- Measure Recall@K and improve chunking/metadata/reranking.
- Only then evaluate the final answer quality.
Step 7 — Logging + feedback: how RAG improves over time
The fastest way to get “enterprise-grade” reliability is to collect failures and fix the root cause: missing docs, bad chunking, wrong metadata, or ambiguous user questions.
Log these fields
- User query
- Retrieved chunk IDs + scores
- Reranked order
- Final answer + citations
- User feedback (thumbs up/down)
Turn logs into improvements
- Bad retrieval → fix chunking/metadata
- Wrong doc version → add version filters
- Ambiguous question → add clarifying step
- Missing knowledge → ingest the missing doc
Common mistakes (and the fixes that actually work)
If RAG feels “random,” it’s usually one of these issues. Fixing them makes the system predictable.
Mistake 1 — Chunking by characters only
This creates chunks that lose headings, split lists, and drop critical context.
- Fix: split by structure (headings), then by size
- Fix: include “title + section” in every chunk
Mistake 2 — No reranking, low precision
Top-5 vector results are often “close” but not correct.
- Fix: retrieve wide (K=30), rerank to N=6
- Fix: keep the final context small and relevant
Mistake 3 — Letting the LLM answer without evidence
If you don’t require citations, the model will “helpfully” fill gaps.
- Fix: require citations for factual claims
- Fix: add an answerability threshold
Mistake 4 — Mixing unrelated docs in one index without metadata
Similar-sounding content collides. You retrieve the wrong policy, the wrong product, or the wrong year.
- Fix: store doc_id/version/section metadata
- Fix: filter by product/team/version when possible
When an answer is wrong, ask: Was the correct chunk retrieved? If no, it’s a retrieval problem. If yes, it’s a generation/guardrail problem.
FAQ
What does RAG stand for?
RAG stands for Retrieval-Augmented Generation: retrieve relevant context (your docs) first, then generate an answer grounded in that context.
What is the best chunk size for RAG?
A strong starting range is 250–600 tokens with 10–20% overlap. But the real rule is: chunks must be meaningful and self-contained. Tune chunking by measuring Recall@K on real questions.
Why not just use keyword search?
Keyword search is great for exact matches. RAG helps when users ask messy questions, paraphrase, or don’t know the exact terms in the docs. Many systems combine both: keyword/BM25 + vector search + reranking.
How do I stop hallucinations in RAG?
You can’t “prompt” your way out of weak retrieval. The highest-impact fixes are: better chunking, reranking, and answerability guardrails (citations required, thresholds, and “I don’t know” behavior).
How do I evaluate a RAG chatbot properly?
Evaluate retrieval first (Recall@K, MRR). If the right chunk isn’t retrieved, the answer can’t be reliable. Then evaluate answer quality with a test set of real user questions, and keep logs to iterate.
Should I use RAG or fine-tuning?
Use RAG when you need up-to-date knowledge and citations. Use fine-tuning when you need consistent behavior, formatting, or classification. In practice, many production systems combine both.
Cheatsheet: the “make RAG reliable” checklist
Build checklist
- Clean docs + version them
- Chunk by headings + meaning
- 250–600 tokens, 10–20% overlap
- Store metadata (doc_id, section, url)
- Retrieve K=30, rerank to N=6
- Return citations with every answer
Debug checklist
- Was the correct chunk retrieved?
- If no → fix chunking/metadata/reranking
- If yes → tighten grounding prompt
- Add answerability threshold
- Log failures and label a test set
The one rule to remember
RAG reliability is mostly retrieval quality. If retrieval is right and the model must cite evidence, the system becomes predictable and trustworthy.
Wrap-up
RAG is the most practical way to make LLMs useful with your data—without retraining models. The “magic” is not the prompt; it’s the boring, high-leverage work: chunking, metadata, retrieval + reranking, and evaluation. Do those well, and your chatbot stops guessing and starts behaving like a reliable assistant.
- Pick 30 real user questions and label the right chunks.
- Measure Recall@K before changing anything.
- Improve chunking + reranking until Recall@K is consistently strong.
- Then tighten guardrails: citations required + no-answer behavior.
Quiz
Quick self-check. This quiz is here for you to test if you learned something new.