AI · RAG

RAG Done Right: Make Chatbots Use Your Data Reliably

Chunking, embeddings, retrieval pitfalls, and evaluation.

Reading time: ~10–14 min
Level: Beginner → Intermediate
Updated:

RAG (Retrieval-Augmented Generation) is how you make an LLM answer using your documents instead of guessing. Done wrong, it confidently hallucinates. Done right, it behaves like a searchable, cited assistant. This guide shows the exact levers that move reliability: chunking, embeddings, retrieval, reranking, and evaluation.


Quickstart: make your RAG noticeably better in 60 minutes

If your chatbot “sometimes uses the docs, sometimes makes things up”, start here. These are the fastest changes with the biggest impact.

1) Fix chunking (the #1 silent killer)

Most “RAG failures” are actually retrieval failures caused by chunks that are too big, too small, or missing context.

  • Chunk by meaning (sections, headings), not just characters
  • Target 250–600 tokens per chunk (start here, then tune)
  • Add overlap (10–20%) so boundary info isn’t lost
  • Store the title + section header inside each chunk

2) Add metadata + filtering

Metadata makes retrieval precise. Without it, “vector search” often returns the wrong page that sounds similar.

  • Store source, url, doc_id, section, date
  • Use filters like doc_id=… or product=… when the query allows
  • Keep a stable citation key (so you can display sources)
  • Prefer “few clean docs” over “everything dumped in one index”

3) Add a reranker (cheap accuracy boost)

Vector search is fast but imperfect. Reranking re-sorts candidates to match the user’s exact question.

  • Retrieve top K=20–50 quickly
  • Rerank down to N=4–8 best chunks
  • Feed only those into the LLM
  • Track hit-rate (did the right chunk appear in top N?)

4) Add “answerability” guardrails

The model should say “I don’t know” when retrieval is weak. This is how you stop confident hallucinations.

  • Require citations for factual answers (“no sources → no claim”)
  • Use a minimum relevance threshold (score or rerank gap)
  • When weak: ask a clarifying question or suggest where to look
  • Log “no-answer” cases for dataset improvement
A simple rule that improves trust

If the bot can’t cite at least one retrieved chunk that contains the answer, it should not present the answer as fact.

Overview: what RAG is (and why it beats “just prompt it”)

A normal LLM answers from its training data. That’s great for general knowledge, but bad for your internal docs, product policies, prices, or anything that changes. RAG solves this by retrieving relevant context first, then generating an answer using that context.

The RAG pipeline in one sentence

User question → retrieve relevant chunks → (optional) rerank → generate answer with citations → log & improve

What RAG is good for

Use case Why RAG fits Typical “gotcha”
Company / internal knowledge base Answers must match current docs Docs are messy → chunking matters
Product support chatbot Need citations to reduce “made-up” fixes Outdated versions in index
Legal / policy Q&A Must cite sources, avoid guessing Ambiguity → needs clarifying questions
Developer docs assistant Search + explain with examples Code blocks need chunk-aware splitting
RAG vs fine-tuning (quick intuition)
  • RAG is best when facts change, and you need citations.
  • Fine-tuning is best for style, formatting, classification, and consistent behavior.
  • Many teams do both: fine-tune for behavior, RAG for knowledge.

Core concepts: the parts that actually control quality

1) Embeddings: how text becomes searchable

Embeddings convert text into vectors (numbers). Similar meanings end up near each other in vector space, so a question can retrieve related passages even if it doesn’t share the same keywords.

When embeddings shine

  • Queries that paraphrase doc language
  • Messy natural questions
  • “What’s the process for…?” style questions

When embeddings struggle

  • Exact IDs, part numbers, SKUs
  • Tables with tiny cells
  • Highly structured data (use filters or SQL)

2) Chunking: the real secret of RAG

The model can only use what you retrieve. Chunking decides what’s retrievable. If a chunk is missing the header, the “who/what/where” context may disappear. If it’s too large, retrieval becomes noisy.

A strong default chunk recipe

  • Split by headings/sections first
  • Then sub-split long sections to ~250–600 tokens
  • Include: doc title + section heading + content inside each chunk
  • Add overlap 10–20%

3) Retrieval: getting the right context, not “similar vibes”

Retrieval is a ranking problem: you want the best chunks in the top few results. Most RAG systems fail because the right chunk exists but doesn’t appear in the top N.

Practical target metrics

Track Recall@K (did the right chunk appear in top K?) and MRR (how high did it rank?). This is more actionable than “the bot seems smarter.”

4) Generation: make the LLM follow the evidence

Even with perfect retrieval, the model can still ignore context. Your generation prompt should force: cite sources, quote specific lines when needed, and admit uncertainty when the docs don’t contain the answer.

The most common production failure

The right chunk was retrieved, but the model answered from “general knowledge” instead. Fix this with: stronger system instruction, citations required, and shorter context (rerank → top N).

Step-by-step: build a RAG system that behaves reliably

This section is intentionally practical. You can implement it with most stacks (LangChain, LlamaIndex, custom code). The order matters: start with data quality and retrieval quality before “prompt engineering.”

Step 1 — Prepare your documents

Do this before chunking

  • Remove duplicate docs and outdated versions (or tag them clearly)
  • Normalize whitespace and fix broken headings
  • Keep links, titles, and source info (you’ll need citations)
  • Decide what should be searchable (some content shouldn’t be indexed)

Step 2 — Chunk with structure (not just length)

The goal is chunks that are meaningful and self-contained. If someone reads one chunk alone, they should understand what doc it came from and what it’s about.

Good chunk boundaries

  • Section headers
  • Lists / procedures
  • FAQ items
  • Code blocks (keep intact)

Bad chunk boundaries

  • Splitting mid-list item
  • Splitting a table row across chunks
  • Dropping the heading from the chunk
  • Mixing unrelated sections together

Step 3 — Index with metadata that helps retrieval

At minimum, store enough metadata to reconstruct a citation and apply filters. Your future self (and your users) will thank you.

{
  "id": "chunk_001239",
  "text": "Doc Title — Section: Refund Policy\n...\nactual content here...",
  "metadata": {
    "source": "handbook.pdf",
    "url": "https://example.com/docs/handbook",
    "doc_id": "handbook_v3",
    "section": "Refund Policy",
    "updated_at": "2026-01-02"
  }
}

Step 4 — Retrieve, then rerank

A strong default is “wide retrieve, narrow rerank”. Wide retrieve ensures recall. Rerank makes precision high.

A safe starting configuration

  • Vector retrieve: topK = 30
  • Rerank: keep topN = 6
  • Pass those 6 chunks to the LLM
  • Show citations (source + section)

Step 5 — Force grounded answers (with citations)

Your prompt should make it easy for the model to be honest. If the context doesn’t contain the answer, it should ask a question or say it’s not in the docs.

Grounding pattern that works
  • “Use only the provided context.”
  • “Cite sources for each claim.”
  • “If missing, say what’s missing and ask a clarifying question.”

Step 6 — Evaluate retrieval (not vibes)

Don’t start by rating answers subjectively. First, measure if retrieval gets the right chunks. If retrieval is wrong, generation will always be unstable.

A simple evaluation loop

  1. Create 30–100 representative questions users actually ask.
  2. For each question, label the correct source chunk(s).
  3. Measure Recall@K and improve chunking/metadata/reranking.
  4. Only then evaluate the final answer quality.

Step 7 — Logging + feedback: how RAG improves over time

The fastest way to get “enterprise-grade” reliability is to collect failures and fix the root cause: missing docs, bad chunking, wrong metadata, or ambiguous user questions.

Log these fields

  • User query
  • Retrieved chunk IDs + scores
  • Reranked order
  • Final answer + citations
  • User feedback (thumbs up/down)

Turn logs into improvements

  • Bad retrieval → fix chunking/metadata
  • Wrong doc version → add version filters
  • Ambiguous question → add clarifying step
  • Missing knowledge → ingest the missing doc

Common mistakes (and the fixes that actually work)

If RAG feels “random,” it’s usually one of these issues. Fixing them makes the system predictable.

Mistake 1 — Chunking by characters only

This creates chunks that lose headings, split lists, and drop critical context.

  • Fix: split by structure (headings), then by size
  • Fix: include “title + section” in every chunk

Mistake 2 — No reranking, low precision

Top-5 vector results are often “close” but not correct.

  • Fix: retrieve wide (K=30), rerank to N=6
  • Fix: keep the final context small and relevant

Mistake 3 — Letting the LLM answer without evidence

If you don’t require citations, the model will “helpfully” fill gaps.

  • Fix: require citations for factual claims
  • Fix: add an answerability threshold

Mistake 4 — Mixing unrelated docs in one index without metadata

Similar-sounding content collides. You retrieve the wrong policy, the wrong product, or the wrong year.

  • Fix: store doc_id/version/section metadata
  • Fix: filter by product/team/version when possible
The best “debug” question

When an answer is wrong, ask: Was the correct chunk retrieved? If no, it’s a retrieval problem. If yes, it’s a generation/guardrail problem.

FAQ

What does RAG stand for?

RAG stands for Retrieval-Augmented Generation: retrieve relevant context (your docs) first, then generate an answer grounded in that context.

What is the best chunk size for RAG?

A strong starting range is 250–600 tokens with 10–20% overlap. But the real rule is: chunks must be meaningful and self-contained. Tune chunking by measuring Recall@K on real questions.

Keyword search is great for exact matches. RAG helps when users ask messy questions, paraphrase, or don’t know the exact terms in the docs. Many systems combine both: keyword/BM25 + vector search + reranking.

How do I stop hallucinations in RAG?

You can’t “prompt” your way out of weak retrieval. The highest-impact fixes are: better chunking, reranking, and answerability guardrails (citations required, thresholds, and “I don’t know” behavior).

How do I evaluate a RAG chatbot properly?

Evaluate retrieval first (Recall@K, MRR). If the right chunk isn’t retrieved, the answer can’t be reliable. Then evaluate answer quality with a test set of real user questions, and keep logs to iterate.

Should I use RAG or fine-tuning?

Use RAG when you need up-to-date knowledge and citations. Use fine-tuning when you need consistent behavior, formatting, or classification. In practice, many production systems combine both.

Cheatsheet: the “make RAG reliable” checklist

Build checklist

  • Clean docs + version them
  • Chunk by headings + meaning
  • 250–600 tokens, 10–20% overlap
  • Store metadata (doc_id, section, url)
  • Retrieve K=30, rerank to N=6
  • Return citations with every answer

Debug checklist

  • Was the correct chunk retrieved?
  • If no → fix chunking/metadata/reranking
  • If yes → tighten grounding prompt
  • Add answerability threshold
  • Log failures and label a test set

The one rule to remember

RAG reliability is mostly retrieval quality. If retrieval is right and the model must cite evidence, the system becomes predictable and trustworthy.

Wrap-up

RAG is the most practical way to make LLMs useful with your data—without retraining models. The “magic” is not the prompt; it’s the boring, high-leverage work: chunking, metadata, retrieval + reranking, and evaluation. Do those well, and your chatbot stops guessing and starts behaving like a reliable assistant.

Your next step
  • Pick 30 real user questions and label the right chunks.
  • Measure Recall@K before changing anything.
  • Improve chunking + reranking until Recall@K is consistently strong.
  • Then tighten guardrails: citations required + no-answer behavior.

Quiz

Quick self-check. This quiz is here for you to test if you learned something new.

1) What does RAG stand for?
2) The biggest cause of “RAG feels random” is usually…
3) A strong default retrieval strategy is…
4) The best way to reduce hallucinations in RAG is…