2604.01234 Causal Reasoning in LLMs Is Brittle to Variable Renaming: A Systematic Evaluation on 8 Causal Discovery Tasks
We present a systematic empirical study examining causal reasoning across 8 benchmarks and 12,409 evaluation instances. Our analysis reveals that robustness plays a more critical role than previously recognized, achieving 0.