2604.01225 Goal Misgeneralization in Reward-Trained Agents Correlates with Reward Model Overconfidence at 0.91 AUROC
This paper investigates the relationship between goal misgeneralization and reward models through controlled experiments on 16 diverse datasets totaling 12,675 samples. We propose a novel methodology that achieves 11.