OvertimeLabs.ai
RAG8 min read30 May 2026

Stop your RAG system hallucinating

TL;DR

A RAG hallucination is usually a retrieval failure — wrong or empty context — before it is a generation failure, so diagnose which one you have first. Fix it by grounding answers strictly in cited context, forcing the model to abstain when context is weak, and raising retrieval quality with hybrid search plus a cross-encoder re-ranker. Then measure faithfulness separately from relevance and track the hallucination rate over time.

When a RAG system makes something up, the model is rarely the first thing at fault. In most of the systems I've debugged, the model was handed wrong, partial, or empty context and did what models do — filled the gap fluently. So the first move isn't a better prompt or a bigger model. It's to find out whether the right passage was even in the context window.

Is this a retrieval failure or a generation failure?

Split the two before you change anything, because the fixes don't overlap.

A retrieval failure means the chunk that contains the answer never reached the model. Either it wasn't retrieved at all, it was retrieved but ranked below your cut-off, or there's no such chunk in the corpus. A generation failure means the right passage was in context and the model still contradicted it, ignored it, or padded around it.

The diagnostic is cheap. For a batch of questions you know the answers to, log the retrieved chunk IDs alongside each answer, then check two things: was a passage containing the gold answer in the retrieved set, and did the answer actually use it?

For each question:
  retrieved_hit  = gold_passage_id in retrieved_ids   # retrieval correct?
  answer_correct = answer matches gold answer          # output correct?
 
  retrieved_hit=False, answer_correct=False -> RETRIEVAL failure (fix search)
  retrieved_hit=True,  answer_correct=False -> GENERATION failure (fix grounding/prompt)
  retrieved_hit=False, answer_correct=True  -> model used parametric memory (dangerous; it got lucky)

That third case is the one people miss. An answer that's correct without the supporting passage in context isn't a win — the model leaned on its training, and next time the fact is slightly different it will be confidently wrong. In my experience the split skews heavily towards retrieval: most "the model lies" complaints turn out to be the retriever handing over rubbish.

How do I make the model ground answers and abstain?

Two rules, enforced in the prompt and checked afterwards: answer only from the retrieved context, and when the context doesn't support an answer, say so rather than guess.

The abstain behaviour is the one that actually moves the hallucination rate, and it's the one teams skip because "I don't have that in the provided documents" feels like a worse product. It isn't. A wrong answer in a legal, financial, or clinical setting costs far more than a refusal.

You answer strictly from the CONTEXT below. Rules:
1. Use ONLY facts stated in the context. Do not use prior knowledge.
2. Cite the source id in square brackets after each claim, e.g. [doc_3].
3. If the context does not contain enough information to answer,
   reply exactly: "I don't have that in the provided documents."
   Do not guess, infer beyond the text, or fill gaps.
4. If sources conflict, say so and cite both.
 
CONTEXT:
{retrieved_chunks_with_ids}
 
QUESTION: {question}

Inline citations do double duty. They give the user something to verify against, and they give you a machine-checkable signal: an answer with no citation, or one citing a doc ID that wasn't in the context, is a red flag you can catch automatically.

What retrieval levers actually move the needle?

If the diagnosis says retrieval, these are the levers in roughly the order they pay off.

Chunking. Most retrieval misses I see are chunking artefacts — answers split across a boundary, or chunks so large the relevant sentence is diluted by noise. Chunk on structure (headings, sections, list items) rather than a blind character count, keep chunks in the region of 200–500 tokens for prose, and add a small overlap so a fact straddling a boundary survives in at least one chunk. Store enough metadata (source, section, tenant) to filter and cite later.

Hybrid lexical plus vector search. Dense embeddings are great at paraphrase and weak at exact tokens — part numbers, error codes, surnames, acronyms. Lexical search (BM25) is the opposite. Run both and fuse the results; reciprocal rank fusion is a sane default and needs no score calibration. This single change rescues the "the answer used a literal string my embedding model smeared away" class of miss.

Cross-encoder re-ranking. Retrieve broad — say the top 50 — then re-score with a cross-encoder that reads the query and each candidate together, and keep the top 5 for the prompt. A bi-encoder embeds query and document separately; a cross-encoder attends across both and is far more discriminating, at the cost of running per candidate. Retrieving wide and re-ranking narrow is the highest-yield quality move after hybrid search, and it directly shrinks the "right passage retrieved but ranked too low to make the cut" failures.

How do I score faithfulness separately from relevance?

Build an eval set and score two different things, because a system can be relevant and unfaithful at the same time.

  • Faithfulness (a.k.a. groundedness): is every claim in the answer supported by a cited source in the retrieved context? This is the hallucination metric. Decompose the answer into atomic claims and check each against its citation.
  • Answer relevance: does the answer actually address the question, regardless of grounding?
  • Context relevance / recall: did retrieval surface the passages needed? This is your retrieval-vs-generation discriminator at the metric level.

Keep them separate or you'll mask problems. A confidently wrong answer can score well on relevance while failing faithfulness — exactly the case you care about. Build the eval set from real questions, include adversarial ones whose answer is not in the corpus (the correct response is the abstain string), and treat any fabricated answer to those as a hard fail. Frameworks like Ragas or TruLens automate the claim-by-claim scoring; an LLM-as-judge works if you validate the judge against a few hundred human labels first.

What guardrails catch the ones that slip through?

Evals tell you the rate; guardrails stop the individual bad answer reaching a user.

  • Confidence / score threshold. If the top re-ranker score is below a floor, don't answer — abstain or escalate. A weak best match is the single strongest predictor of a hallucination.
  • Refuse-to-answer as a first-class path. Wire the abstain string into the UI as a normal outcome, not an error. Make it cheap to say no.
  • Post-hoc claim verification. After generation, run a second, cheap check: does each cited claim actually appear in the chunk it cites? Strip or flag claims that don't. This catches the model citing a real doc ID for a claim that doc never made.
  • Citation integrity. Reject answers citing IDs not in the supplied context. Trivial to check, and it catches a whole class of fabrication.

How does per-tenant access control fit in?

Grounding is worthless if retrieval can pull a document the user isn't allowed to see. In multi-tenant systems the access check belongs inside retrieval, applied as a metadata filter at query time, not bolted on after — a post-filter that drops restricted hits can still leak through the model if the chunk reached the context, and it quietly degrades ranking.

Stamp every chunk with its tenant and ACL metadata at ingestion, and make the retriever require a tenant/permission filter on every query — no filter, no results. Test it adversarially: ask tenant A a question whose only answer lives in tenant B's documents and confirm you get the abstain string, not a cross-tenant leak. That test belongs in the eval suite permanently.

How do I track the hallucination rate over time?

Pick one headline number — faithfulness failures as a share of answered questions — and put it on a dashboard with the abstain rate next to it. The two trade off: drive abstention to zero and hallucinations climb; choke retrieval and abstention spikes while users get nothing useful. You're tuning the balance, so you need both visible.

Run the eval on every change to chunking, embeddings, the re-ranker, or the prompt — retrieval quality drifts silently when the corpus grows, an embedding model is swapped, or a new document type is ingested. Sample real production traffic into the eval set continuously so you measure the questions users actually ask, not just your golden set.

One neutral note on multilingual corpora: mixed-language and right-to-left content (Hebrew, Arabic) adds retrieval failure modes — tokenisation and normalisation differences, weaker cross-lingual embedding alignment, and chunking that mishandles RTL or bidirectional text. If your corpus is multilingual, segment your eval metrics by language; a healthy aggregate score routinely hides a much worse number in the smaller language.

When is this worth the effort?

The full stack — hybrid retrieval, cross-encoder re-ranking, claim-level faithfulness evals, post-hoc verification, per-tenant filtering — is justified when a wrong answer is expensive or unsafe: regulated domains, customer-facing support over a knowledge base, anything where a confident fabrication carries legal or financial weight.

It's over-engineering for a low-stakes internal tool over a handful of trusted documents, where good chunking, a grounding-and-abstain prompt, and citations get you most of the way. The order of work, though, is the same everywhere: diagnose retrieval versus generation first, because re-ranking won't help a generation problem and a stricter prompt won't help a retrieval one. Spend the effort where the failures actually are.

FAQ

Why are most RAG hallucinations retrieval failures rather than generation failures?

Because models fill gaps fluently. When the passage containing the answer never reaches the context window — not retrieved, ranked too low, or absent from the corpus — the model invents something plausible. Log retrieved chunk IDs against gold answers to confirm whether retrieval or generation is at fault before changing anything.

What is the difference between faithfulness and answer relevance?

Faithfulness asks whether every claim in the answer is supported by a cited source in the retrieved context — it is the hallucination metric. Answer relevance only asks whether the answer addresses the question. A confidently wrong answer can score high on relevance while failing faithfulness, so score them separately.

How do I stop a RAG system leaking restricted documents between tenants?

Apply access control inside retrieval as a metadata filter at query time, not as a post-filter. Stamp every chunk with tenant and ACL metadata at ingestion, require a permission filter on every query, and add an adversarial cross-tenant test to your eval suite permanently.

Want this built — or reviewed — properly?

Book a 15-minute call and tell me what you're working on.