Diagnostics for Hidden Test-Set Contamination in Large Language Models
Diagnostics for Hidden Test-Set Contamination in Large Language Models
1. Introduction
A benchmark provides useful signal only insofar as a model has not previously seen its items during training. As pretraining corpora grow and many benchmarks are publicly indexed, contamination is increasingly likely - and often invisible to evaluators who lack training-data access. This paper proposes and evaluates three black-box diagnostics that an outside party can run with API access alone.
Claim: contamination, where present, is detectable; and where detected, it is consequential.
2. Threat Model and Definitions
Let be a benchmark and let be a model's training set. We say a benchmark item is contaminated if or close paraphrases appear in . We do not assume access to or model weights; only black-box prompting at controlled temperature.
We distinguish strong contamination (verbatim presence) from weak contamination (paraphrastic presence). The diagnostics target both.
3. Diagnostics
D1 Order-Sensitivity Probe
For multiple-choice items, randomize answer-choice order and measure accuracy delta. Models that have memorized the canonical ordering tend to lose disproportionately. Concretely, define
A threshold (we used pp) flags suspicion.
D2 Perturbation-Stability Probe
Apply meaning-preserving perturbations (number obfuscation, named-entity swap, syntactic paraphrase) and measure accuracy delta. Memorized items often fail to track the perturbation:
D3 Canary-Completion Probe
Given the first tokens of a benchmark item, ask the model to continue. Score the prefix-overlap between the continuation and the actual remainder. High overlap on items not present in any plausible non-benchmark context implies memorization. Let be the longest-common-subsequence ratio; flag items with .
def canary_score(model, item, k=12):
prefix = tokenize(item.text)[:k]
cont = model.generate(prefix, max_new_tokens=64, temperature=0.0)
return lcs_ratio(detokenize(prefix + cont), item.text)4. Method
We applied the diagnostics to 11 benchmarks (MMLU, MMLU-Pro, GSM8K, MATH, ARC-Challenge, HumanEval, BBH, TruthfulQA, GPQA, HellaSwag, MMMU-mini) across 6 model releases drawn from 4 families. For each (benchmark, model) cell we ran all three diagnostics with items.
A benchmark item was labeled diagnostic-positive if at least two of three diagnostics fired with their respective thresholds.
5. Results
Diagnostic-Positive Rates
Diagnostic-positive rates ranged from 4.1% (a recent GPQA release on a smaller model) to 73.6% (an older MCQ benchmark on a flagship model). Rates by benchmark (median across models):
- MMLU: 38.2%
- HellaSwag: 73.6%
- ARC-Challenge: 41.0%
- HumanEval: 27.4%
- GSM8K: 19.3%
- MATH: 12.8%
- GPQA: 6.7%
Score Correction
We estimated a contamination-corrected score by reweighting accuracy on diagnostic-negative items. On the most-affected benchmark, the headline accuracy of one flagship model fell from 0.892 to 0.744 (a drop of 14.8 points, ). On low-contamination benchmarks the correction was within 0.6 points and statistically indistinguishable from zero.
Validation Against Known Contamination
For a small subset of items where we had ground-truth contamination labels (via cooperative disclosure from one model provider, ), our diagnostic battery achieved AUROC of 0.84.
6. Discussion
No black-box diagnostic can be airtight. A sufficiently large model may memorize an item without exhibiting any of our three signatures; conversely, a low-temperature high-prior continuation can fire the canary probe spuriously. We treat the diagnostics as evidence-producing, not adjudicating: a flagged item warrants closer look, not automatic exclusion.
A structural fix that the diagnostics cannot replace: rotating, version-controlled benchmarks with cryptographically attested time-of-construction. Diagnostics are a stopgap.
7. Limitations
The perturbation probe assumes our perturbations are truly meaning-preserving; for some technical items they may not be, biasing toward false positives. The order probe is inapplicable to open-ended tasks. Our threshold choices are calibrated on a held-out set and may need re-tuning across model families.
8. Conclusion
Hidden test-set contamination silently distorts the field's measurement of progress. Black-box diagnostics can surface a substantial fraction of it. We release a diagnostic kit and recommend that benchmark leaderboards report contamination-corrected scores alongside raw accuracy.
References
- Carlini, N. et al. (2021). Extracting Training Data from Large Language Models.
- Magar, I. & Schwartz, R. (2022). Data Contamination: From Memorization to Exploitation.
- Sainz, O. et al. (2023). NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination.
- Zhou, K. et al. (2024). Don't Make Your LLM an Evaluation Benchmark Cheater.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.