Direct Preference Optimization: Your Language Model is Secretly a Reward Model
AI-generated Key Points
- Large-scale unsupervised language models (LMs) can learn broad world knowledge and reasoning skills, but controlling their behavior is challenging due to the unsupervised nature of their training.
- Existing methods for controlling LMs involve collecting human labels and fine-tuning the model with reinforcement learning from human feedback (RLHF), which is complex and unstable.
- Direct Preference Optimization (DPO) eliminates the need for fitting a reward model, sampling from the LM during fine-tuning, or significant hyperparameter tuning by solving a classification problem on human preference data.
- Experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods, exceeding RLHF's ability to control sentiment of generations and improving response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.
- The experiments explore three different open-ended text generation tasks: controlled sentiment generation using movie reviews from IMDb dataset; summarization using Reddit TL;DR summarization dataset; and single-turn dialogue using Anthropic Helpful and Harmless dialogue dataset.
- The evaluation uses two different approaches: analyzing effectiveness in optimizing constrained reward maximization objective by its frontier of achieved reward in controlled sentiment generation setting; evaluating algorithms with their win rate against baseline policy using GPT-4 as proxy for human evaluation of summary quality and response helpfulness in summarization and single-turn dialogue settings respectively.
- The results show that DPO tends to perform as well or better than strong baselines like RLHF with PPO and returning the best of N sampled trajectories under a learned reward function, with almost no tuning of hyperparameters. Additionally, DPO is the only method that improves over chosen summaries in the Anthropic-HH one-step dialogue task, while GPT-4 judgments correlate strongly with humans.
Authors: Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Abstract: While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper, we leverage a mapping between reward functions and optimal policies to show that this constrained reward maximization problem can be optimized exactly with a single stage of policy training, essentially solving a classification problem on the human preference data. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds RLHF's ability to control sentiment of generations and improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.