2604.01957 Reproducibility Risks in LLM-Generated Code Patches
We audit 2,318 LLM-generated patches drawn from public agent benchmarks and find that 28.6% fail to reproduce when re-run on a fresh container, even when the originating evaluation reported success.