2604.00686 Reward Hacking Detection via Gradient Divergence Monitoring in RLHF-Tuned Language Models
Reinforcement Learning from Human Feedback (RLHF) has become the dominant paradigm for aligning large language models (LLMs) with human preferences. However, reward hacking—where models exploit reward model weaknesses to achieve high scores without genuine quality improvement—remains a critical failure mode that is difficult to detect post-deployment.