Token-Level Entropy as a Hallucination Predictor in Open-Ended Generation
Token-Level Entropy as a Hallucination Predictor in Open-Ended Generation
1. Introduction
A persistent question for production LLMs is whether the model's internal uncertainty signal can be used, online and cheaply, to flag likely hallucinations. Several recent works [Kuhn et al. 2023, Manakul et al. 2023] examine sequence-level and self-consistency proxies. We focus instead on the simpler question: can per-token entropy alone do useful work?
Our contributions:
- A controlled study of entropy vs. hallucination on 6,820 labeled outputs.
- A negative result: raw entropy is too noisy on function words to be useful directly.
- A positive result: a role-gated estimator (ENTRO-G) closes much of the gap.
2. Background
For a token with conditional distribution , the entropy is
Function words (e.g., articles, prepositions) tend to have moderate even in well-grounded text, while content words referring to entities have lower entropy when the model is confident and high entropy when it is fabricating.
3. Data
We collected 6,820 outputs from three model families (a 7B-class, a 70B-class, and a frontier API model) on three task types: biographical QA, scientific summarization, and code-to-spec explanation. Each output was span-annotated by two reviewers (Cohen's ) with the labels grounded, unverifiable, or hallucinated.
4. Method
4.1 Baseline
Mean entropy over a candidate span:
4.2 ENTRO-G
We partition tokens into roles via a small POS+NER tagger. The gated estimator is
G(s) = \frac{\sum{i \in s} w_{r_i} H_i}{\sum_{i \in s} w_{r_i}}
with , , . Weights were chosen on a 20% dev split via grid search.
def entro_g(token_entropies, roles):
w = {"entity": 1.0, "function": 0.0, "other": 0.5}
num = sum(w[r] * h for r, h in zip(roles, token_entropies))
den = sum(w[r] for r in roles) or 1e-9
return num / den5. Results
5.1 Distributional shift
Mean per-token entropy was 2.31 nats on hallucinated spans vs. 0.89 on grounded spans (Mann-Whitney U, , effect size ).
5.2 Detection performance
| Model | AUROC | F1 @ optimal threshold |
|---|---|---|
| Raw entropy (mean) | 0.71 | 0.62 |
| Raw entropy (max) | 0.69 | 0.59 |
| Self-consistency (5 samples) | 0.81 | 0.71 |
| ENTRO-G | 0.84 | 0.74 |
| ENTRO-G + self-consistency | 0.88 | 0.78 |
ENTRO-G adds roughly +0.13 AUROC over raw mean entropy at no extra inference cost beyond a lightweight tagger.
6. Discussion
Entropy is a local signal and cannot detect hallucinations that arise from globally inconsistent but locally confident generation (e.g., a fluent fabricated citation). For these we recommend ENTRO-G as a first-stage filter upstream of a more expensive consistency check.
We also observed that on the 70B model, raw mean entropy was nearly useless (), because the model is locally confident even when wrong. The role-gated variant, which weighs entity tokens more, partially compensates.
7. Limitations
Our tagger is itself imperfect (entity F1 = 0.91 on our data); errors propagate. We did not evaluate non-English text or code, where the role taxonomy needs revision.
8. Conclusion
Per-token entropy contains a real but limited hallucination signal. Filtering it through a simple role gate, however, makes it competitive with a 5-sample self-consistency check at a fraction of the cost. We believe ENTRO-G is a reasonable default for budget-constrained online filtering.
References
- Kuhn, L. et al. (2023). Semantic Uncertainty for LLM Generation.
- Manakul, P. et al. (2023). SelfCheckGPT.
- Lin, S. et al. (2024). Generating with Confidence.
- Honnibal, M. and Montani, I. (2020). spaCy 3.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.