Why your RAG pipeline returns wrong chunks (and the 5-minute diagnostic)
Your RAG retrieval looks fine in isolation but the agent answers wrong in production. The cause is rarely the embedding model. Here are the 4 most common reasons retrieval returns the wrong chunks — and how to diagnose which one is yours in under 5 minutes.
The symptom
Your agent answers customer questions using RAG over your documentation. The eval set passes. Real customers get the wrong answer. When you inspect the trace, you can see: the wrong chunks were retrieved. The right chunk was sitting in the index. It was not returned.
You re-rank. You switch embedding models. You change chunk size. The problem moves but does not disappear.
What is actually happening
"Wrong chunks" is a symptom with four common root causes. They look identical at the output level but require different fixes. Most teams cycle through chunk-size and embedding changes hoping to stumble onto the right one. That is a slow way to solve it.
The five-minute diagnostic below identifies which of the four causes is yours.
The 3 things people try first that do not fix it
- Switching from
text-embedding-ada-002to a "better" embedding model. Sometimes helps. Rarely the root cause. - Cranking up
top_kfrom 3 to 10. Hides the symptom by letting the wrong chunks be present alongside the right ones. Now the synthesizer is the bottleneck. - Adding a re-ranker. Useful, but if your retrieval is missing the chunk entirely, no re-ranker can save you. You can only re-rank what you retrieved.
The 4 actual causes
Cause 1: Chunk-boundary truncation
Your chunker split the document on a fixed 500-token boundary that happens to land in the middle of the relevant section. The query matches the chunk that contains the question phrasing but not the chunk that contains the answer.
Symptom signature: the right answer exists in your corpus, but spans two adjacent chunks. The retrieval returns the chunk with the question setup but not the answer.
Diagnostic: take the query. Search your corpus with simple keyword search (grep, BM25). Find the chunk that contains the answer. Now look at the adjacent chunks. Is the question phrasing in a different chunk than the answer? Cause 1.
Fix: introduce chunk overlap (the cheap option, 20-30% overlap) and semantic chunking (the right option — split on section boundaries, not token counts). Use a recursive splitter that respects headings:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=200, # 25% overlap
separators=["\n## ", "\n### ", "\n\n", "\n", ". ", " "],
)
Cause 2: The embedding model does not know your domain's vocabulary
You are using a general-purpose embedding model (OpenAI text-embedding-3-small, Cohere embed-english-v3.0). Your domain uses jargon, product names, or internal terminology that the embedding model has never seen.
Symptom signature: queries with common English phrases work. Queries using your product names or technical terms retrieve unrelated content.
Diagnostic: take 10 failing queries. Manually identify the rare/jargon term in each. Run the same queries with the jargon term replaced by its English description ("Tier 3 license" → "highest pricing tier"). Do the queries now succeed? Cause 2.
Fix: two options:
- Synonym expansion at query time. Before embedding the query, expand jargon terms inline:
query = "Does the Tier 3 license include API access?" expanded = "Does the Tier 3 license (highest pricing tier, enterprise plan) include API access?" - Fine-tune the embedding model on your domain corpus. Higher leverage, more upfront work. Tools:
sentence-transformersfinetuning, Cohere fine-tuning API, OpenAI custom embeddings.
Cause 3: Stale or out-of-sync index
The query is fine. The embedding model is fine. The chunk that contains the answer was updated last week and the index was not re-built. The index still has the old version, which does not match the query.
Symptom signature: the right answer is in the documentation but missing from retrieval. Recently-edited sections fail more often than old stable ones.
Diagnostic: take the query. Pull the chunk that should match from your live documentation source. Pull the same chunk from your vector index. Compare them byte-for-byte. If they differ, Cause 3.
Fix: wire your index to a change-feed of the source, not a periodic batch:
- Document edits → trigger embedding update → upsert to index
- Track
source_hashper chunk; alert on hash drift between source and index - Add a "last indexed at" timestamp surface so you can prove freshness in audits
If you cannot do streaming updates, run the rebuild often (daily, not weekly) and monitor the lag as a first-class metric.
Cause 4: The query embedding is structurally different from the chunk embeddings
The most subtle cause. Your chunks are long, formal, descriptive paragraphs from documentation. Your queries are short, terse, often question-shaped. The embedding model places long descriptive text in a different region of the vector space than short questions, even when they are about the same topic.
Symptom signature: documentation indexes well. Q&A pairs and short queries retrieve poorly.
Diagnostic: for 10 failing queries, compute the cosine similarity between the query embedding and the correct chunk's embedding. Compare to the similarity between the query and a wrong chunk that was retrieved. If both are low (< 0.5) and close to each other, the model cannot distinguish them. Cause 4.
Fix: transform the chunks at index time so they look more like queries. Two common patterns:
- HyDE (Hypothetical Document Embedding): at query time, ask the LLM to generate a hypothetical answer first, then embed the answer (not the question). Retrieves chunks that match the answer's shape.
- Synthetic-question augmentation: at index time, generate 3-5 likely questions for each chunk. Embed both the chunk and the questions. Store under the chunk's id. The query now has multiple chances to hit.
# Synthetic question augmentation at index time
for chunk in chunks:
questions = llm.generate_questions_for_chunk(chunk, n=5)
index.add(chunk.id, chunk.text, embedding=embed(chunk.text))
for q in questions:
index.add(chunk.id, q, embedding=embed(q)) # Same chunk id, multiple entries
The 5-minute diagnostic
Given a failing query, run these checks in order:
- Is the right chunk in the corpus at all? Keyword search confirms presence. If not, no retrieval fix will help — content gap.
- Is the chunk boundary splitting the answer? Adjacent chunks contain question and answer separately → Cause 1.
- Does replacing jargon with plain English fix it? → Cause 2.
- Is the index out of sync with the source? Source hash check → Cause 3.
- Are query and chunk embeddings dissimilar despite same topic? Cosine check → Cause 4.
Most production RAG bugs are one of these four. The diagnostic takes longer to read than to execute.
What "good" looks like
A well-engineered RAG pipeline has:
- Semantic chunking with 20-30% overlap
- Synonym expansion or domain-tuned embeddings for jargon-heavy corpora
- Streaming or daily index refresh with freshness monitoring
- HyDE or synthetic-question augmentation for short-query workloads
- An eval suite that tests retrieval alone, separately from answer quality
If your eval suite only scores final answers, you cannot tell whether a wrong answer is a retrieval failure or a synthesis failure. They have different fixes. Score them separately.
When to get help
If you cannot identify which of the four causes you have, or the diagnostic suggests you have all four interacting, the issue is usually that retrieval was tuned for the wrong workload at the start (long-form docs indexed for short-query traffic, for example). That is fixable but takes a structured re-build, not a hyperparameter tweak.
Free audit covers this — bring your failing queries, we run the diagnostic live, you have a fix path within 24 hours.
Hitting this exact failure? Skip the debugging.
Book a free 30-minute call. We scope where your agent is breaking and map the fix. Then we commit to the result and work until we hit it. No pitch deck, no obligation.