Doubly Robust Estimation in Reward-Modeling Pipelines for RLHF
Doubly Robust Estimation in Reward-Modeling Pipelines for RLHF
1. Introduction
Reinforcement learning from human feedback [Christiano et al. 2017; Ouyang et al. 2022] hinges on a learned reward model standing in for unobserved human preferences. During training and during candidate-policy selection, practitioners typically estimate the value of a policy as
{\text{plug-in}}(\pi) = \mathbb{E}{x \sim \rho, y \sim \pi(\cdot \mid x)} [\hat{R}(x, y)].
This is biased whenever is biased — which it always is, in practice. We borrow doubly robust (DR) estimation from causal inference and off-policy evaluation [Dudík et al. 2011] to construct an estimator that is unbiased whenever either or an auxiliary propensity model is correct.
2. Setup
Let be the prompt distribution, a candidate policy, and a behavior policy that produced the labeled preference data. Let denote the true (unobserved) preference value, and the learned reward. We have a labeled set where is observed via human label.
3. Doubly Robust Estimator
Define the propensity and the outcome model . The DR value of policy is
{\text{DR}}(\pi) = \mathbb{E}{(x, y) \sim \rho \times \pi_{\text{behav}}} \left[ \frac{\pi(y \mid x)}{e(x, y)} (R(x,y) - \hat{R}(x, y)) \right] + \mathbb{E}_{x \sim \rho, y \sim \pi} [\hat{R}(x, y)],
where is the labeled-data target derived from .
Theorem (informal). If either or exactly, then in expectation.*
The variance of {\text{DR}} is no worse than {\text{plug-in}} when has lower variance than scaled by importance weights, which holds in typical regimes.
4. Practical Considerations
Importance weight clipping. Raw importance ratios can be extreme, especially when has shifted from the behavior distribution. We clip at , accepting a small bias in exchange for variance reduction.
Propensity learning. The behavior policy is often a mixture (multiple SFT checkpoints). We fit using a small classifier trained to distinguish samples from each component, calibrated with isotonic regression.
Cross-fitting. Following [Chernozhukov et al. 2018], we split data into folds and use a fold's and to evaluate the held-out fold, eliminating own-prediction bias.
def dr_value(prompts, samples, R_hat, e_hat, pi):
out = 0.0
for x, y, r in samples:
w = min(pi(y, x) / e_hat(y, x), 50.0)
out += w * (r - R_hat(x, y))
out /= len(samples)
out += sum(R_hat(x, pi.sample(x)) for x in prompts) / len(prompts)
return out5. Empirical Evaluation
Synthetic ground-truth setup. We constructed a benchmark of prompts with synthetic ground-truth preferences derived from a known scalar utility function (perplexity-difference relative to a hidden reference). Eighteen candidate policies were obtained by varying RLHF training duration and KL penalty.
RMSE. Plug-in policy value RMSE was across the 18 policies. DR achieved RMSE — a reduction (95% CI , paired bootstrap, ).
Best-policy selection. Across 20 trials with resampled labeled data, plug-in selected the truly best policy in trials; DR did so in trials. The remaining DR errors were among the top-3 policies.
Compute overhead. DR adds roughly to evaluation cost: the propensity model is small relative to the reward model, and the additional sample-rollout for the second term is shared with standard evaluation pipelines.
6. Real-Data Experiment
We ran a smaller experiment on a real preference dataset (N = 64{,}210 pairs) with no ground truth available. We compared the ranking of three candidate policies under plug-in versus DR, then ran a small human evaluation ( prompts, raters per prompt) as a partial ground truth. DR's ranking agreed with the human-evaluation ranking; plug-in's did not. We caution that this is a single experiment and should not be over-generalized.
7. Discussion and Limitations
The doubly robust estimator inherits the assumptions of off-policy evaluation: in particular, common support — the behavior policy must place positive probability on every action the candidate policy might take. Strong off-policy shifts violate this and inflate variance. We monitor effective sample size and refuse to estimate when falls below of .
A limitation specific to the reward-modeling setting is that is always partly observable and partly latent: human labels themselves are noisy. We treat as the labeled target rather than , which means our "ground truth" is best understood as the calibrated label rather than ineffable preference.
8. Conclusion
Doubly robust estimation is a small, practical change to reward-modeling pipelines that meaningfully reduces error in policy-value estimates and improves best-policy selection. It is unbiased under weaker assumptions than plug-in, costs only a few percent in compute, and integrates cleanly with existing RLHF training loops.
References
- Christiano, P. et al. (2017). Deep Reinforcement Learning from Human Preferences.
- Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback.
- Dudík, M., Langford, J., and Li, L. (2011). Doubly Robust Policy Evaluation and Learning.
- Chernozhukov, V. et al. (2018). Double/Debiased Machine Learning.
- Bai, Y. et al. (2022). Training a Helpful and Harmless Assistant with RLHF.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.