A Survey of Citation-Hallucination Patterns Across Model Families and Eras
A Survey of Citation-Hallucination Patterns Across Model Families and Eras
Introduction
Citation hallucination - the production of references to papers that do not exist or do not contain the claimed content - has been a recurring concern since LLMs began being used for scientific writing. Most published numbers are point-in-time and study-specific. We provide a longitudinal survey: 22 model releases, four families, 30 months, one protocol.
Protocol
Prompts
We used a single, fixed prompt template that asks for a 6-citation annotated bibliography on each of 200 prompt topics drawn from a topic pool stratified across CS, life-sciences, social-sciences, and humanities. The prompt deliberately does not invoke retrieval, in order to measure parametric behavior.
Ground-Truth Resolution
For each generated reference we issued queries to four indices (Crossref, OpenAlex, Semantic Scholar, and a curated subject-specific index) and resolved a reference as real if at least two indices returned a confident match on (authors, title, year).
Coding
A reference is classified as:
- Sound - real, with venue and year matching.
- Partial fabrication - one of (authors, title) is correct but the other is wrong.
- Venue confusion - real paper exists but the venue or year is wrong.
- Fabrication - no matching paper found.
Method
For each release we generated 1,200 references (200 topics x 6) at temperature 0 with seed 17. Total survey volume: 26,400 references.
We summarize per-release behavior by
Results
Longitudinal Trends
Across the 30-month window we observed the following medians (across families) by approximate release date:
| Period | |||
|---|---|---|---|
| 2023 H1 | 38.4% | 21.7% | 17.9% |
| 2023 H2 | 27.1% | 19.0% | 17.4% |
| 2024 H1 | 19.6% | 16.2% | 16.8% |
| 2024 H2 | 13.4% | 14.1% | 16.1% |
| 2025 H1 | 9.8% | 12.5% | 16.4% |
| 2025 H2 | 7.4% | 11.7% | 15.9% |
| 2026 H1 | 6.7% | 11.0% | 15.6% |
Fabrication has roughly halved every 12 months. Partial fabrication has fallen more slowly. Venue confusion is essentially flat across releases - a striking finding that suggests it is being addressed by neither scaling nor the standard alignment recipes.
Three Regimes
Fitting Gaussian mixtures over per-topic fabrication rates yields three regimes:
- High-citation-density topics (well-trodden ML subfields, biomedicine): is small and dominated by partial fabrications.
- Medium-density topics (most of social sciences): is larger and frequently confused at venue level.
- Low-density topics (specialized humanities, regional studies): remains near 2024 levels even on 2026 releases.
Let be expected fabrication as a function of training-coverage proxy . Empirically with .
Interventions
Where we could pair adjacent releases that differed in a documented training intervention, we observed:
- Citation-annotated SFT data: median 32% reduction in , no significant move in .
- RAG-distilled training: 41% reduction in , 11% reduction in .
- Pure scaling (no documented data change): 18% reduction in , no change in .
def classify_reference(ref, indices):
matches = [idx.search(ref) for idx in indices]
if confidently_real(matches):
return "sound" if venue_year_ok(ref, matches) else "venue_confusion"
if partial_match(matches):
return "partial"
return "fabrication"Discussion
The field has made real progress on outright fabrication but has barely moved on the more subtle errors. From a reader's perspective these are arguably worse: a citation that names real authors and a real paper is far harder to flag, and venue confusion subtly degrades the citation graph used by downstream services.
The near-flat trace suggests current training pipelines treat venue and year as lower-priority signals than authors/title. A targeted intervention - e.g., venue-conditioned prediction objectives during SFT - looks promising and is, to our knowledge, untried.
Limitations
Indices are incomplete; small-conference papers may be undercounted. Our prompt is fixed; behavior under retrieval-augmented prompting is materially different (and generally better) and is out of scope here. We did not control for differences in tokenizer or chat template across releases; we report behavior as deployed.
Conclusion
We present, to our knowledge, the first 30-month longitudinal survey of citation-hallucination in LLMs. Outright fabrication is steadily falling; partial fabrication is falling more slowly; venue confusion is flat. The next gains are likely to come from data interventions targeted at the still-stubborn classes.
References
- Walters, W. & Wilder, E. (2024). Fabrication and Errors in Citations Generated by ChatGPT.
- Liang, P. et al. (2023). Holistic Evaluation of Language Models.
- Ji, Z. et al. (2023). Survey of Hallucination in Natural Language Generation.
- Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.