Frequency-Dependent Hallucination Rates in Large Language Models: Rare Entities Are Not Created Equal
Abstract
We study hallucination rates across 12,000 entities stratified by frequency, type, and recency. Hallucination follows a power law , but is modulated by entity type (persons 2.3x more than locations) and temporal recency (4.1x increase for post-cutoff entities). We introduce the Entity Hallucination Profile (EHP) for targeted mitigation.
1. Introduction
Hallucination—the generation of factually incorrect content—is a central challenge for deploying LLMs in high-stakes applications [1]. A widely held intuition is that hallucination correlates inversely with entity frequency in the training data: models know more about common entities and hallucinate more about rare ones [2].
While this intuition is directionally correct, it is overly simplistic. Not all rare entities hallucinate equally, and the nature of hallucination varies systematically across the frequency spectrum. Understanding these patterns is crucial for developing targeted mitigation strategies.
2. Entity Stratification
2.1 Frequency Bins
We use Wikipedia page views (2023 annual) as a proxy for training data frequency:
| Bin | Page Views/Year | Entities | Label |
|---|---|---|---|
| F1 | > 10M | 500 | Very common |
| F2 | 1M - 10M | 1000 | Common |
| F3 | 100K - 1M | 2000 | Moderate |
| F4 | 10K - 100K | 3000 | Uncommon |
| F5 | 1K - 10K | 3000 | Rare |
| F6 | < 1K | 2500 | Very rare |
2.2 Entity Types
Each frequency bin is balanced across four types:
- Person (P): Named individuals
- Location (L): Geographic entities
- Organization (O): Companies, institutions
- Event (E): Historical events, conferences
2.3 Temporal Recency
Entities are additionally tagged as:
- Stable (S): Core facts unchanged in 5+ years
- Dynamic (D): Key facts changed in last 2 years
- Post-cutoff (C): Key facts changed after the model's knowledge cutoff
3. Experimental Setup
For each entity, we generate 5 factual questions (verified by human annotators). Models answer under greedy decoding with no retrieval augmentation.
| Model | Knowledge Cutoff | Parameters |
|---|---|---|
| GPT-4-Turbo | Apr 2024 | ~1.8T (est.) |
| Claude-3-Opus | Aug 2024 | Undisclosed |
| LLaMA-3-70B | Dec 2023 | 70B |
| Mistral-8x7B | Sep 2023 | 46.7B |
| Qwen-2-72B | Jun 2024 | 72B |
Hallucination is judged by three annotators per question () as: Correct, Hallucinated (confident false claim), Abstained (model declines to answer), or Hedged (uncertain but incorrect).
4. Results
4.1 Frequency-Hallucination Power Law
Across all models and entity types:
| Frequency Bin | Mean Hallucination Rate | 95% CI |
|---|---|---|
| F1 (Very common) | 3.2% | [2.7, 3.7] |
| F2 (Common) | 7.8% | [7.1, 8.5] |
| F3 (Moderate) | 14.1% | [13.2, 15.0] |
| F4 (Uncommon) | 24.7% | [23.5, 25.9] |
| F5 (Rare) | 38.3% | [37.0, 39.6] |
| F6 (Very rare) | 52.6% | [51.1, 54.1] |
Fitting : , .
4.2 Entity Type Modulation
Hallucination rate by type, controlling for frequency (F4 bin):
| Type | Hall. Rate | Relative to Location | p-value |
|---|---|---|---|
| Person | 31.8% | 2.31x | < 0.001 |
| Organization | 26.4% | 1.92x | < 0.001 |
| Event | 22.1% | 1.61x | 0.003 |
| Location | 13.7% | 1.00x | — |
Locations hallucinate least because their core facts (coordinates, country, population) are more constrained and more frequently repeated in training data.
4.3 Temporal Recency Effect
| Recency | Hall. Rate (F4) | Relative to Stable |
|---|---|---|
| Stable | 19.2% | 1.00x |
| Dynamic | 34.8% | 1.81x |
| Post-cutoff | 78.6% | 4.09x |
Post-cutoff entities exhibit near-total hallucination because models generate outdated information with high confidence.
4.4 Hallucination Typology
| Type | Common Entities (F1-F2) | Rare Entities (F5-F6) |
|---|---|---|
| Plausible confabulation | 18% | 64% |
| Confident fabrication | 71% | 21% |
| Hedged incorrect | 8% | 11% |
| Temporal confusion | 3% | 4% |
For common entities, models are confident even when wrong (they "know" the entity but get specific facts wrong). For rare entities, models confuse similar entities (they mix up the entity with a more common neighbor).
4.5 Model Comparison
| Model | Mean Hall. Rate | Power-Law | Type Sensitivity |
|---|---|---|---|
| GPT-4-Turbo | 18.4% | 0.41 | 2.08x (P/L) |
| Claude-3-Opus | 16.7% | 0.39 | 1.94x (P/L) |
| LLaMA-3-70B | 22.1% | 0.45 | 2.51x (P/L) |
| Mistral-8x7B | 26.3% | 0.48 | 2.73x (P/L) |
| Qwen-2-72B | 20.8% | 0.43 | 2.19x (P/L) |
All models follow similar power-law exponents (), suggesting this reflects a fundamental property of autoregressive pretraining rather than architecture-specific behavior.
5. The Entity Hallucination Profile (EHP)
We define EHP as a three-dimensional risk map: , where is frequency, is type, and is recency. Given a new entity, its predicted hallucination rate is:
where for L, E, O, P and for S, D, C.
Validation on a held-out set of 2,000 entities: predicted vs. actual hallucination rate correlation , MAE = 4.2 percentage points.
6. Discussion
6.1 Targeted Mitigation
EHP enables targeted retrieval augmentation: instead of applying RAG uniformly (which adds latency and cost), systems can selectively augment queries about entities in high-risk regions of the EHP space (rare + person + post-cutoff).
6.2 Limitations
- Proxy frequency: Wikipedia page views are an imperfect proxy for training data frequency.
- English-only: Entity frequency distributions differ across languages.
- Binary hallucination: We don't grade hallucination severity.
- Static analysis: Models may reduce hallucination through retrieval or self-verification at inference time.
- No causal mechanism: We characterize the relationship statistically but don't explain why persons hallucinate more than locations at the representation level.
7. Conclusion
Hallucination rates follow a power law with entity frequency (), are modulated by entity type (persons 2.3x more than locations), and spike 4.1x for post-cutoff entities. The Entity Hallucination Profile provides a practical risk assessment tool. These findings argue against uniform hallucination mitigation and for targeted, entity-aware strategies.
References
[1] Z. Ji et al., "Survey of hallucination in natural language generation," ACM Computing Surveys, 2023.
[2] S. Kandpal et al., "Large language models struggle to learn tail knowledge," ICML, 2023.
[3] P. Lewis et al., "Retrieval-augmented generation for knowledge-intensive NLP tasks," NeurIPS, 2020.
[4] K. Mallen et al., "When not to trust language models: Investigating effectiveness of parametric and non-parametric memories," ACL, 2023.
[5] A. Min et al., "FActScore: Fine-grained atomic evaluation of factual precision," EMNLP, 2023.
[6] N. Muennighoff et al., "Scaling data-constrained language models," NeurIPS, 2023.
[7] S. Li et al., "Inference-time intervention: Eliciting truthful answers from a language model," NeurIPS, 2023.
[8] T. Sun et al., "Head-to-tail: How knowledgeable are large language models?," NAACL, 2024.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.