A Taxonomy of Failure Modes in Retrieval-Augmented Generation Systems
A Taxonomy of Failure Modes in Retrieval-Augmented Generation
1. Introduction
RAG has become the default architecture for grounding LLMs in proprietary or current information [Lewis et al. 2020]. Despite widespread deployment, its failure modes are reported in scattered, system-specific terms: "the retriever missed," "the model hallucinated," "the context was too long." These descriptions are not falsifiable and do not allow practitioners to compare systems or to predict where their own systems will fail.
We propose a structured taxonomy with three axes: retrieval, fusion, and generation. Each axis admits several distinct failure modes. We instrument an empirical study of seven open-source RAG stacks plus a hand-built reference and report mode-conditional failure rates across three corpora.
2. The Taxonomy
Our 14-mode taxonomy is summarized below. Modes are mutually exclusive at the per-question level; we resolve ambiguities by going from generation backwards.
2.1 Retrieval axis
- R1 — Empty retrieval: zero documents returned.
- R2 — Off-topic retrieval: documents returned but no relevant chunks present.
- R3 — Stale retrieval: relevant chunks present but superseded by newer documents not retrieved.
- R4 — Truncated relevant: a relevant document is retrieved but the relevant span lies outside the chunk window.
- R5 — Reranker mis-ordering: relevant chunks present but ranked below distractors.
2.2 Fusion axis
- F1 — Context overflow truncation: relevant chunk dropped due to context limit.
- F2 — Distractor dominance: model attends to a high-similarity but irrelevant chunk.
- F3 — Partial-evidence aggregation: information needed in two chunks; only one survives.
- F4 — Citation-evidence mismatch: model cites chunk A but uses content from chunk B.
- F5 — Format-induced misreading: chunk formatting (tables, lists) confuses parsing.
2.3 Generation axis
- G1 — Pure hallucination: claim has no support in any retrieved chunk.
- G2 — Over-generalization: claim is broader than its supporting evidence.
- G3 — Confident under-answer: model declines despite sufficient evidence.
- G4 — Style-induced error: model paraphrases evidence into a falsehood.
3. Method
3.1 Corpora
We constructed three corpora at three quality tiers:
- C-LO: web-scrape of mid-2023, with 8% known-stale and 4% known-incorrect documents.
- C-MED: cleaned Wikipedia subset, 0.5% known-incorrect.
- C-HI: hand-curated medical guideline corpus.
3.2 Questions
4{,}812 questions, half synthesized from gold passages and half written by domain experts.
3.3 Systems
Seven open-source RAG stacks at default settings, plus an in-house reference system tuned per corpus.
3.4 Annotation
For each error, two annotators independently assigned a mode label. Adjudication for disagreements yielded .
def classify_error(q, retrieved, generated, gold):
if not retrieved: return "R1"
if no_relevant(retrieved, gold): return "R2"
...
if hallucinated(generated, retrieved): return "G1"
if too_broad(generated, gold): return "G2"
return "G4"4. Results
4.1 Mode-conditional rates
| Mode | C-LO | C-MED | C-HI |
|---|---|---|---|
| R1 Empty | 8.1% | 1.2% | 0.4% |
| R2 Off-topic | 18.7% | 5.4% | 2.1% |
| R3 Stale | 14.0% | 1.9% | 0.6% |
| R4 Truncated relevant | 6.2% | 4.8% | 3.7% |
| R5 Reranker mis-order | 5.1% | 6.0% | 5.4% |
| F1 Overflow | 3.4% | 5.7% | 7.2% |
| F2 Distractor dominance | 5.8% | 11.3% | 12.0% |
| F3 Partial-evidence | 3.0% | 4.9% | 6.4% |
| F4 Citation-evidence | 2.1% | 3.6% | 4.8% |
| F5 Format misreading | 4.0% | 5.1% | 4.7% |
| G1 Hallucination | 11.2% | 9.4% | 9.1% |
| G2 Over-generalization | 5.7% | 6.8% | 7.0% |
| G3 Under-answer | 1.9% | 2.4% | 2.8% |
| G4 Style-induced | 1.6% | 2.0% | 2.5% |
| Total error rate | 40.4% | 27.1% | 19.4% |
4.2 Aggregated by axis
- C-LO: R = 52%, F = 18%, G = 30% of all errors.
- C-HI: R = 32%, F = 47%, G = 21%.
The inversion is pedagogically important: as corpus quality rises, the bottleneck shifts from retrieval to fusion. Investing in a better retriever yields little marginal benefit in C-HI; investing in a better reranker plus context-management policy yields most.
4.3 Hallucination floor
G1 stabilizes between 9.1% and 11.2% of questions across corpora --- a floor that does not respond to corpus quality. We hypothesize that this floor is set by the model's prior over plausible-sounding claims and is therefore a model-level rather than system-level concern.
5. Discussion
The taxonomy lets us answer questions of the form "Where should I invest a fixed engineering budget?" without having to instrument every failure category. A practitioner observing F2 (distractor dominance) at 12% knows the highest-leverage interventions: better reranking, MMR-style diversity penalties, or scoped context windows.
We explicitly do not claim the taxonomy is complete. Two modes we considered but rejected for low prevalence: embedding-model drift across versions and adversarial chunking; both are open avenues.
6. Limitations
Mode boundaries are sometimes fuzzy. Our adjudication rule resolves toward the latest-stage cause, but a different convention might attribute up to 14% of errors differently. The taxonomy is also developed for short-answer QA; long-form report generation likely needs additional modes.
7. Conclusion
Reporting RAG failures in a shared taxonomy enables practitioners to compare systems, prioritize fixes, and predict where new corpora will fail. We release the labeled error set and an evaluation harness so the community can audit and extend our findings.
References
- Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
- Gao, Y. et al. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey.
- Shi, F. et al. (2023). Large Language Models Can Be Easily Distracted by Irrelevant Context.
- Es, S. et al. (2024). RAGAS: Automated Evaluation of Retrieval-Augmented Generation.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.