HaF-RM: A Hybrid Alignment Framework for Reward Model Training

Shujun Liu, Xiaoyu Shen, Yuhang Lai, Siyuan Wang, Shengbin Yue
Zengfeng Huang, Xuanjing Huang, Zhongyu Wei

Fudan University
University of Southern California

HaF is a reward model training framework for high-quality alignment.

Abstract

The reward model has become increasingly important in alignment, assessment, and data construction for large language models (LLMs). Most existing researchers focus on enhancing reward models through data improvements, following the conventional training framework for reward models that directly optimizes the predicted rewards.

In this paper, we propose a hybrid alignment framework HaF-RM for reward model training by introducing an additional constraint on token-level policy probabilities in addition to the reward score. It can simultaneously supervise the internal preference model at the token level and optimize the mapping layer of the reward model at the sequence level.

Theoretical justifications and experiment results on five datasets show the validity and effectiveness of our proposed hybrid framework for training a high-quality reward model. By decoupling the reward modeling procedure and incorporating hybrid supervision, our HaF-RM framework offers a principled and effective approach to enhancing the performance and alignment of reward models, a critical component in the responsible development of powerful language models.

Motivation

Training the reward model involves aligning the model preference and learning a "preference-reward" projection. Initialized with well-trained LLM, the reward model has nearly aligned preference while it has to learn the projection from scratch.

DPO loss shares the same premise with reward loss. DPO model can be considered as a token-wise reward model which functions similarly with the standard reward model (sequence-wise). So we can use DPO loss as an auxiliary loss function to improve the stability of training process and further boost the reward model's performance.

Preliminary Study: Overall Performance

(Left)The reward difference of HaF-generated response and the other response predicted by HaF model tends to be higher than it should be. (Above y=x)

(Right)The token-wise perplexities of DPO model have a positive relevance with the rewards of the reward model when trained with the same dataset.

Takeaway 1

The internal preference model determines the token probabilities and the rewards.

Intrinsic: Overall Performance

The HaF model can more sensitively identify whether an answer is good and give a more accurate high (or low) score.

The HaF model is basically not waker than the baseline under various circumstances.

Intrinsic: Mixed Data

HaF can better learn the diversity present in the combined datasets for generalization.

Intrinsic: OOD Data

DPO model converge to approximately 50% in a highly exaggerated manner, indicating a complete loss of modeling capability for OOD data.

HaF possesses a strong ability to learn preferences and effectively generalize them to similar preference distributions, despite great differences in language style and topi

Takeaway 2

HaF has better intrinsic performances among reward models.

DPO is sensitive to the distribution shift.

Downstream Task: Best-of-N

HaF demonstrates significant advantages over the baseline especially for Phi-2 model.

Top-k recall for baseline is close to the result with random selection, while HaF can effectively distinguish between responses when the quality differences are minimal.

Downstream Task: RLHF

The model trained with HaF has higher win rates.

Takeaway 3

HaF generates more precise reward values during training and inference.

BibTeX

@article{liu2024hafrm,
      title={HaF-RM: A Hybrid Alignment Framework for Reward Model Training}, 
      author={Liu, Shujun and Shen, Xiaoyu and Lai, Yuhang and Wang, Siyuan and Yue, Shengbin and Huang, Zengfeng and Huang, Xuanjing and Wei, Zhongyu},
      journal={arXiv preprint arXiv:2407.04185},
      year={2024}
}