Axiomatic Preference Modeling for Longform Question Answering
AI-generated Key Points
- Study explores the use of large language models (LLMs) like GPT-4 in longform question answering
- Focus on post-training process of Reinforcement Learning from Human Feedback (RLHF)
- RLHF involves human preferences encoded in a reward model (RM)
- Identified principles to guide RMs and developed an axiomatic framework for generating preference signals
- Trained a standalone preference model with approximately 220M parameters using axiomatic signals
- Preference Model can score both human- and LLM-generated answers on the same scale
- Contributions include outperforming GPT-4 in preference scoring, developing axiomatic framework for tailored training data pairs, and demonstrating improvement over GPT-4 with small amount of axiomatic signals
- Model released on huggingface, providing access to research findings and approach
- Paper accepted for presentation at EMNLP 2023.
Authors: Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Awadallah, Paul Bennett
Abstract: The remarkable abilities of large language models (LLMs) like GPT-4 partially stem from post-training processes like Reinforcement Learning from Human Feedback (RLHF) involving human preferences encoded in a reward model. However, these reward models (RMs) often lack direct knowledge of why, or under what principles, the preferences annotations were made. In this study, we identify principles that guide RMs to better align with human preferences, and then develop an axiomatic framework to generate a rich variety of preference signals to uphold them. We use these axiomatic signals to train a model for scoring answers to longform questions. Our approach yields a Preference Model with only about 220M parameters that agrees with gold human-annotated preference labels more often than GPT-4. The contributions of this work include: training a standalone preference model that can score human- and LLM-generated answers on the same scale; developing an axiomatic framework for generating training data pairs tailored to certain principles; and showing that a small amount of axiomatic signals can help small models outperform GPT-4 in preference scoring. We release our model on huggingface: https://huggingface.co/corbyrosset/axiomatic_preference_model
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.