Axiomatic Preference Modeling for Longform Question Answering

AI-generated keywords: Axiomatic Preference Modeling Longform Question Answering Reinforcement Learning Human Feedback

AI-generated Key Points

  • Study explores the use of large language models (LLMs) like GPT-4 in longform question answering
  • Focus on post-training process of Reinforcement Learning from Human Feedback (RLHF)
  • RLHF involves human preferences encoded in a reward model (RM)
  • Identified principles to guide RMs and developed an axiomatic framework for generating preference signals
  • Trained a standalone preference model with approximately 220M parameters using axiomatic signals
  • Preference Model can score both human- and LLM-generated answers on the same scale
  • Contributions include outperforming GPT-4 in preference scoring, developing axiomatic framework for tailored training data pairs, and demonstrating improvement over GPT-4 with small amount of axiomatic signals
  • Model released on huggingface, providing access to research findings and approach
  • Paper accepted for presentation at EMNLP 2023.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Awadallah, Paul Bennett

Accepted to EMNLP 2023
License: CC BY 4.0

Abstract: The remarkable abilities of large language models (LLMs) like GPT-4 partially stem from post-training processes like Reinforcement Learning from Human Feedback (RLHF) involving human preferences encoded in a reward model. However, these reward models (RMs) often lack direct knowledge of why, or under what principles, the preferences annotations were made. In this study, we identify principles that guide RMs to better align with human preferences, and then develop an axiomatic framework to generate a rich variety of preference signals to uphold them. We use these axiomatic signals to train a model for scoring answers to longform questions. Our approach yields a Preference Model with only about 220M parameters that agrees with gold human-annotated preference labels more often than GPT-4. The contributions of this work include: training a standalone preference model that can score human- and LLM-generated answers on the same scale; developing an axiomatic framework for generating training data pairs tailored to certain principles; and showing that a small amount of axiomatic signals can help small models outperform GPT-4 in preference scoring. We release our model on huggingface: https://huggingface.co/corbyrosset/axiomatic_preference_model

Submitted to arXiv on 02 Dec. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2312.02206v1

In this study titled "Axiomatic Preference Modeling for Longform Question Answering," the authors explore the use of large language models (LLMs) like GPT-4 in generating answers to longform questions. They focus on the post-training process of Reinforcement Learning from Human Feedback (RLHF), which involves human preferences encoded in a reward model (RM). To address the issue of RMs lacking direct knowledge of why certain preferences annotations were made, the authors identified principles that guide RMs to better align with human preferences and developed an axiomatic framework to generate a variety of preference signals based on these principles. They used these axiomatic signals to train a standalone preference model with approximately 220M parameters. This Preference Model is capable of scoring both human- and LLM-generated answers on the same scale. The contributions of this work include training a preference model that outperforms GPT-4 in preference scoring, developing an axiomatic framework for generating tailored training data pairs based on certain principles, and demonstrating that even a small amount of axiomatic signals can help smaller models perform better than GPT-4. The authors have released their model on huggingface, providing access to their research findings and approach. The paper has been accepted for presentation at EMNLP 2023.
Created on 06 Dec. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.