Reproducibility Risks in LLM-Generated Code Patches
Reproducibility Risks in LLM-Generated Code Patches
1. Introduction
LLM coding agents are increasingly evaluated on benchmarks such as SWE-bench [Jimenez et al. 2024] and LiveCodeBench [Jain et al. 2024]. A patch that passes on the evaluator's machine is taken as evidence the agent succeeded. But how often does the same patch pass when re-run elsewhere? We attempt a careful, large-scale answer.
2. Background and Threat Model
Let be a patch produced by an agent against a target repository at commit . Let denote the test outcome under environment . The benchmark reports for the evaluator's environment . Reproducibility asks: is for an independent evaluator's environment ?
We focus on Python, JavaScript, and Go ecosystems, where dependency resolution is dynamic and time-dependent.
3. Method
3.1 Re-evaluation harness
For each patch in our corpus we:
- Spin up a fresh container from the original Dockerfile.
- Apply the patch.
- Run the prescribed test command.
- Compare against the benchmark's recorded outcome.
We ran independent re-evaluations per patch on different host machines and at staggered wall-clock times.
3.2 Reproducibility metric
Define the reproduction rate of a patch as
A patch is reproducible if , flaky if , and broken if .
4. Results
Across 2,318 patches from 7 public benchmarks:
- 71.4% reproducible
- 14.2% flaky
- 14.4% broken
4.1 Failure-mode taxonomy
| Cause | Share of failures |
|---|---|
| Implicit env (locale, TZ, $PATH) | 31.2% |
Dependency drift (pip, npm) |
27.8% |
| Time / date sensitivity | 14.6% |
| Test ordering | 11.0% |
| Network reliance | 9.4% |
| Other | 6.0% |
Dependency drift is concentrated in patches dated more than 30 days before our re-run — a Cox-like hazard fit yields a half-life of approximately 47 days for pip-based projects.
4.2 The harness
We propose a repro-shim that wraps test execution and pins:
TZ=UTC,LANG=C.UTF-8, frozen$PATH- A read-only mirror of the dependency index at patch creation time
faketimeto fix wall-clock to the benchmark's recorded datePYTHONHASHSEED=0and per-language deterministic seeds
#!/usr/bin/env bash
# repro-shim: invoke under TZ/locale/time/seed pinning
export TZ=UTC LANG=C.UTF-8 PYTHONHASHSEED=0
faketime "$BENCH_DATE" \
env -i HOME=<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>H</mi><mi>O</mi><mi>M</mi><mi>E</mi><mi>P</mi><mi>A</mi><mi>T</mi><mi>H</mi><mo>=</mo></mrow><annotation encoding="application/x-tex">HOME PATH=</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.0813em;">H</span><span class="mord mathnormal" style="margin-right:0.0278em;">O</span><span class="mord mathnormal" style="margin-right:0.109em;">M</span><span class="mord mathnormal" style="margin-right:0.0576em;">E</span><span class="mord mathnormal" style="margin-right:0.1389em;">P</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.1389em;">T</span><span class="mord mathnormal" style="margin-right:0.0813em;">H</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span></span></span></span>FROZEN_PATH \
"$@"With the shim in place, the broken+flaky rate drops from 28.6% to 4.1% (n=2,318; , McNemar). Median wall-clock overhead is 6.8%.
5. Discussion and Limitations
Our method cannot disentangle agent-introduced flakiness from flakiness already present in the upstream test suite. We mitigated this by also re-running the unpatched baseline; baseline flakiness explains roughly one third of observed flakiness at the patch level.
We did not evaluate Java or C++ projects; their build systems present different reproducibility hazards (e.g. timestamped artifacts in JARs).
6. Conclusion
A non-trivial fraction of LLM-generated patches that are reported as successful fail under independent re-evaluation. The cause is rarely a logic error in the patch — more often it is environmental coupling. A small shim removes most of the gap and we recommend benchmark organizers adopt it.
References
- Jimenez, C. et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- Jain, N. et al. (2024). LiveCodeBench: Holistic and Contamination-Free Evaluation of Large Language Models for Code.
- Lampel, J. (2013). libfaketime documentation.
- Pinto, G. et al. (2020). A Large-Scale Study on Test Flakiness.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.