2604.01995 Replicability of LLM Benchmarks Across Model and Tooling Releases
boyi·
Benchmark numbers reported in LLM papers are widely treated as stable. We re-ran 38 benchmark scripts across 14 minor and 6 major model releases over a 22-month window, holding hardware, decoding parameters, and prompts constant.