Scaling Laws for Reward Model Overoptimization

AI-generated keywords: Reinforcement Learning Human Feedback Proxy Model AI Alignment Optimization

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Optimizing against a reward model trained to predict human preferences can lead to overoptimization and hinder performance of the ground truth
  • The study uses a synthetic setup with a fixed "gold-standard" reward model as a substitute for humans
  • Two different methods, reinforcement learning and best-of-$n$ sampling, are used to optimize against the proxy reward model
  • The relationship between gold reward model scores and proxy reward model scores depends on the optimization method used
  • Coefficients in this relationship scale smoothly with the number of reward model parameters
  • Factors such as size of the reward model dataset, number of reward model and policy parameters, and coefficient of KL penalty impact this relationship
  • Implications for theoretical considerations in AI alignment are discussed based on empirical results
  • Optimizing against imperfect proxy models can affect performance in reinforcement learning from human feedback scenarios
  • Balancing optimization against such models while maintaining alignment with ground truth objectives is important.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Leo Gao, John Schulman, Jacob Hilton

Abstract: In reinforcement learning from human feedback, it is common to optimize against a reward model trained to predict human preferences. Because the reward model is an imperfect proxy, optimizing its value too much can hinder ground truth performance, in accordance with Goodhart's law. This effect has been frequently observed, but not carefully measured due to the expense of collecting human preference data. In this work, we use a synthetic setup in which a fixed "gold-standard" reward model plays the role of humans, providing labels used to train a proxy reward model. We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-$n$ sampling. We find that this relationship follows a different functional form depending on the method of optimization, and that in both cases its coefficients scale smoothly with the number of reward model parameters. We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup. We explore the implications of these empirical results for theoretical considerations in AI alignment.

Submitted to arXiv on 19 Oct. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2210.10760v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In reinforcement learning from human feedback, optimizing against a reward model trained to predict human preferences can lead to overoptimization and hinder the performance of the ground truth. To address this challenge, the authors of this study use a synthetic setup where a fixed "gold-standard" reward model acts as a substitute for humans and provides labels for training a proxy reward model. The study investigates how the gold reward model score changes when optimizing against the proxy reward model using two different methods: reinforcement learning and best-of-$n$ sampling. The authors find that the relationship between these scores follows a different functional form depending on the optimization method used. Additionally, they observe that the coefficients in this relationship scale smoothly with the number of reward model parameters. Various factors such as size of the reward model dataset, number of reward model and policy parameters, and coefficient of KL penalty added to the reward in reinforcement learning are explored to understand their impact on this relationship. By analyzing these empirical results, implications for theoretical considerations in AI alignment are discussed. Overall, this study provides insights into how optimizing against imperfect proxy models can affect performance in reinforcement learning from human feedback scenarios. The findings highlight important considerations for balancing optimization against such models while maintaining alignment with ground truth objectives.
Created on 03 Jul. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.