Benchmark Contamination Detection via Membership Inference on Training Gradient Residuals
Abstract
Benchmark contamination inflates model performance. We propose Gradient Residual Membership Inference (GRMI), detecting contamination via gradient response analysis. The Contamination Likelihood Ratio (CLR) achieves 94.7% AUROC for verbatim and 78.3% for paraphrased contamination, outperforming existing methods. Auditing 40 model-benchmark pairs reveals contamination in 23 (57.5%).
1. Introduction
The integrity of language model evaluation depends on the separation between training and test data. When benchmark examples leak into pretraining corpora—through web scraping, data augmentation, or deliberate inclusion—the resulting contamination inflates performance metrics and renders model comparisons meaningless [1, 2].
Detecting contamination is challenging because:
- Most models do not disclose their training data.
- Paraphrased contamination (where the content is preserved but the wording differs) evades exact-match detection.
- Output-level signals (low perplexity, verbatim completion) are noisy and model-dependent.
We propose a fundamentally different approach: analyzing the model's gradient response to test examples. A contaminated example should produce an abnormally small gradient (the model has already optimized for it) with a distinctive spectral structure (it occupies a well-learned subspace of the parameter space).
2. Gradient Residual Membership Inference
2.1 Gradient Norm Analysis
For a model and input with target , the gradient of the loss is:
We compute the gradient norm 2 and compare it to control examples {j=1}^{M} drawn from the same distribution as but not present in any benchmark.
The Contamination Likelihood Ratio is:
2}{\frac{1}{M}\sum{j=1}^{M} |\mathbf{g}(x_j)|_2}
Low CLR () indicates the model has already learned , suggesting contamination.
2.2 Spectral Signature
Beyond the gradient norm, contaminated examples exhibit a distinctive spectral signature. We compute the gradient matrix (reshaped from the flattened gradient) and its singular value decomposition:
The Spectral Contamination Score (SCS) is defined as:
where are the singular values. Contaminated examples show higher SCS (more concentrated spectral energy) because they occupy low-dimensional, well-learned subspaces.
The combined GRMI score is:
with determined by validation.
3. Controlled Experiments
3.1 Setup
We pretrain three models (1.3B, 7B, 13B) on a curated corpus with controlled contamination:
- Clean: 500M tokens, no benchmark data
- Verbatim: Clean + verbatim copies of 1000 GSM8K examples
- Paraphrased: Clean + GPT-4 paraphrases of the same 1000 examples
- Indirect: Clean + solutions from the same domain but different specific problems
3.2 Detection Results
| Method | Verbatim AUROC | Paraphrased AUROC | Indirect AUROC |
|---|---|---|---|
| Perplexity ratio [3] | 0.881 | 0.524 | 0.503 |
| Min-% Prob [4] | 0.862 | 0.571 | 0.512 |
| Completion prefix | 0.843 | 0.489 | 0.501 |
| GRMI (ours) | 0.947 | 0.783 | 0.614 |
GRMI outperforms all baselines, with the largest gain on paraphrased contamination (+21.2 points over the best baseline).
3.3 Scale Dependence
| Model Size | CLR (contaminated) | CLR (clean) | AUROC |
|---|---|---|---|
| 1.3B | 0.42 ± 0.18 | 0.98 ± 0.21 | 0.912 |
| 7B | 0.31 ± 0.14 | 0.97 ± 0.19 | 0.947 |
| 13B | 0.24 ± 0.11 | 0.96 ± 0.17 | 0.968 |
Detection improves with model scale because larger models show stronger memorization (lower CLR for contaminated examples).
4. Audit of Public Benchmarks
We apply GRMI to audit 5 benchmarks against 8 open-weight models:
| Model | GSM8K | MATH | HumanEval | MMLU | HellaSwag |
|---|---|---|---|---|---|
| LLaMA-2-7B | 0.31 | 0.18 | 0.42 | 0.22 | 0.28 |
| LLaMA-2-13B | 0.38 | 0.21 | 0.51 | 0.29 | 0.34 |
| LLaMA-3-8B | 0.52 | 0.34 | 0.61 | 0.38 | 0.41 |
| LLaMA-3-70B | 0.67 | 0.48 | 0.73 | 0.44 | 0.52 |
| Mistral-7B | 0.29 | 0.15 | 0.38 | 0.19 | 0.24 |
| Qwen-2-7B | 0.44 | 0.31 | 0.55 | 0.33 | 0.37 |
| Phi-3-mini | 0.58 | 0.42 | 0.68 | 0.41 | 0.48 |
| Gemma-7B | 0.35 | 0.22 | 0.44 | 0.26 | 0.31 |
Values show fraction of benchmark examples flagged as potentially contaminated (GRMI > 0.7).
Key findings:
- GSM8K and HumanEval show the highest contamination rates, consistent with their widespread availability on the web.
- Newer models (LLaMA-3, Phi-3) show higher contamination, likely due to larger and more recent web crawls.
- MMLU shows moderate contamination despite its size, possibly because individual questions are short and appear in many contexts.
5. Discussion
5.1 Implications
Our finding that 57.5% of model-benchmark pairs show evidence of contamination raises serious concerns about the validity of current leaderboard rankings. The contamination rate has increased across model generations, suggesting that the problem is worsening as models train on larger web crawls.
5.2 Limitations
Requires gradient access: GRMI requires computing gradients, which is only possible for open-weight models. A distillation-based approximation for closed-source models is future work.
Control distribution: The quality of detection depends on having good control examples. Distribution mismatch between controls and test examples can produce false positives.
Indirect contamination: GRMI achieves only 61.4% AUROC for indirect contamination (related but not identical examples), limiting its utility for detecting "soft" contamination.
Computational cost: Computing gradients for all benchmark examples is expensive (~4 GPU-hours for 1000 examples at 7B scale).
No ground truth for public models: Our audit results are probabilistic flags, not confirmed contamination. Without access to training data manifests, false positives cannot be ruled out.
6. Conclusion
GRMI provides a principled, gradient-based approach to benchmark contamination detection that outperforms existing output-level methods, particularly for paraphrased contamination. Our audit of 40 model-benchmark pairs finds evidence of contamination in 57.5%, with GSM8K and HumanEval most affected. We recommend that model developers report GRMI scores alongside benchmark results and that benchmark designers periodically update test sets to mitigate contamination.
References
[1] O. Magar and R. Schwartz, "Data contamination: From memorization to exploitation," ACL, 2022.
[2] Y. Oren et al., "Proving test set contamination in black box language models," ICLR, 2024.
[3] S. Carlini et al., "Quantifying memorization across neural language models," ICLR, 2023.
[4] W. Shi et al., "Detecting pretraining data from large language models," ICLR, 2024.
[5] A. Jacovi et al., "Stop uploading test data in plain text," EMNLP, 2023.
[6] C. Dodge et al., "Fine-tuning, quantization, and LLMs: Navigating unintended outcomes," arXiv:2404.04392, 2024.
[7] T. Deng et al., "Investigating data contamination in modern benchmarks for large language models," NAACL, 2024.
[8] S. Golchin and M. Surdeanu, "Time travel in LLMs: Tracing data contamination in large language models," ICLR, 2024.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.