Influence-Function Diagnostics for Reward Models in RLHF
Influence-Function Diagnostics for Reward Models in RLHF
1. Introduction
Reward models in RLHF are trained on preference pairs via a Bradley-Terry log-likelihood. When the reward model misbehaves — assigning high reward to a known-bad response, or refusing to discriminate between completions of obviously different quality — the standard remedies are to add more data, change the loss, or change the architecture. A more direct question is: which training examples are responsible for this misbehavior?
Influence functions [Koh & Liang 2017] answer that question for differentiable models, in principle. In practice they are dominated by the cost of inverting (or implicitly inverting) the Hessian of the training loss. We give a low-rank Gauss-Newton approximation specific to the BT reward head that brings the per-example cost to a few seconds on a 7B-parameter model.
2. Setup
Let be a reward model with parameters . The training loss on pair is
The influence of training point on a test query is approximately
3. Low-Rank Gauss-Newton Approximation
The Bradley-Terry loss has a clean Gauss-Newton structure:
where and .
We take a random projection with and form the projected Gauss-Newton matrix , which is cheap to invert. The resulting estimator is a Johnson-Lindenstrauss-style approximation to with relative error in the typical regime.
Memory drops from to — for a 7B-parameter model with , this is roughly 7 GB rather than the prohibitive 50 PB required for the dense Hessian.
4. Experiments
4.1 Setup
We apply the diagnostic to three reward-model checkpoints:
- RM-A: 7B model, 280K preference pairs, in-distribution helpfulness.
- RM-B: 7B model, 280K + 50K hard-coding-task pairs.
- RM-C: 1.5B model, 95K pairs (older internal run).
For each model we compute influence of each training pair on a held-out evaluation set of 4,000 prompts.
4.2 Variance concentration
In all three runs, the top 1% of training pairs by absolute influence account for 25-29% of the variance in evaluation-time reward predictions. The top 0.1% accounts for 7-9%.
4.3 Influential outliers
Manual inspection of the top-100 most-influential training pairs reveals systematic failure modes: in RM-A, 12 of the top 100 pairs have and that are nearly identical (cosine similarity > 0.97 in embedding space) and were apparently mis-labeled by automated tooling. In RM-B, 7 of the top 100 are pairs where the "chosen" response is clearly worse on close reading.
4.4 Targeted ablation
We retrain RM-A after removing the 1,150 highest-|influence| pairs (0.4% of the dataset). On a held-out adversarial set of 600 prompts, the new model eliminates 2 of 3 regressions identified in the original RM-A:
| Adversarial subset | Original RM-A | After ablation |
|---|---|---|
| Code-deception probes | 64.1% | 73.8% |
| Subtle-refusal probes | 71.3% | 78.4% |
| Sycophancy probes | 58.7% | 60.1% |
The sycophancy regression persists, which is consistent with [Sharma et al. 2024]'s finding that sycophancy is broadly distributed across the dataset.
def projected_influence(grad_q, grad_train, H_proj_inv, Pi):
return -float(grad_q @ Pi.T @ H_proj_inv @ Pi @ grad_train)5. Limitations
- The Gauss-Newton approximation drops second-derivative terms; in regions where the BT logits are far from , the weighting damps these terms but does not eliminate them.
- Influence is a local notion: it estimates the effect of an infinitesimal perturbation around the trained model, not the effect of dropping a point and retraining from scratch. The latter can differ by an order of magnitude on heavy-tailed loss landscapes [Bae et al. 2022].
- Random projection variance becomes nontrivial for ; we recommend .
6. Conclusion
A tractable per-example influence diagnostic is feasible for modern reward models, fits into a standard training pipeline, and surfaces concrete data-quality issues that are otherwise invisible. The 0.4%-removal experiment suggests substantial returns on relatively small data interventions when guided by influence rather than random sampling.
References
- Koh, P. W., & Liang, P. (2017). Understanding black-box predictions via influence functions.
- Bae, J. et al. (2022). If influence functions are the answer, then what is the question?
- Sharma, M. et al. (2024). Towards understanding sycophancy in language models.
- Pruthi, G. et al. (2020). Estimating training data influence by tracing gradient descent.
- Achiam, J. et al. (2023). GPT-4 Technical Report.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.