Axiomatic Preference Modeling for Longform Question Answering

AI-generated keywords: Axiomatic Preference Modeling Longform Question Answering Reinforcement Learning Human Feedback

AI-generated Key Points

Study explores the use of large language models (LLMs) like GPT-4 in longform question answering
Focus on post-training process of Reinforcement Learning from Human Feedback (RLHF)
RLHF involves human preferences encoded in a reward model (RM)
Identified principles to guide RMs and developed an axiomatic framework for generating preference signals
Trained a standalone preference model with approximately 220M parameters using axiomatic signals
Preference Model can score both human- and LLM-generated answers on the same scale
Contributions include outperforming GPT-4 in preference scoring, developing axiomatic framework for tailored training data pairs, and demonstrating improvement over GPT-4 with small amount of axiomatic signals
Model released on huggingface, providing access to research findings and approach
Paper accepted for presentation at EMNLP 2023.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Awadallah, Paul Bennett

arXiv: 2312.02206v1 - DOI (cs.AI)

Accepted to EMNLP 2023

License: CC BY 4.0

Abstract: The remarkable abilities of large language models (LLMs) like GPT-4 partially stem from post-training processes like Reinforcement Learning from Human Feedback (RLHF) involving human preferences encoded in a reward model. However, these reward models (RMs) often lack direct knowledge of why, or under what principles, the preferences annotations were made. In this study, we identify principles that guide RMs to better align with human preferences, and then develop an axiomatic framework to generate a rich variety of preference signals to uphold them. We use these axiomatic signals to train a model for scoring answers to longform questions. Our approach yields a Preference Model with only about 220M parameters that agrees with gold human-annotated preference labels more often than GPT-4. The contributions of this work include: training a standalone preference model that can score human- and LLM-generated answers on the same scale; developing an axiomatic framework for generating training data pairs tailored to certain principles; and showing that a small amount of axiomatic signals can help small models outperform GPT-4 in preference scoring. We release our model on huggingface: https://huggingface.co/corbyrosset/axiomatic_preference_model

Submitted to arXiv on 02 Dec. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2312.02206v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

In this study titled "Axiomatic Preference Modeling for Longform Question Answering," the authors explore the use of large language models (LLMs) like GPT-4 in generating answers to longform questions. They focus on the post-training process of Reinforcement Learning from Human Feedback (RLHF), which involves human preferences encoded in a reward model (RM). To address the issue of RMs lacking direct knowledge of why certain preferences annotations were made, the authors identified principles that guide RMs to better align with human preferences and developed an axiomatic framework to generate a variety of preference signals based on these principles. They used these axiomatic signals to train a standalone preference model with approximately 220M parameters. This Preference Model is capable of scoring both human- and LLM-generated answers on the same scale. The contributions of this work include training a preference model that outperforms GPT-4 in preference scoring, developing an axiomatic framework for generating tailored training data pairs based on certain principles, and demonstrating that even a small amount of axiomatic signals can help smaller models perform better than GPT-4. The authors have released their model on huggingface, providing access to their research findings and approach. The paper has been accepted for presentation at EMNLP 2023.

- Study explores the use of large language models (LLMs) like GPT-4 in longform question answering
- Focus on post-training process of Reinforcement Learning from Human Feedback (RLHF)
- RLHF involves human preferences encoded in a reward model (RM)
- Identified principles to guide RMs and developed an axiomatic framework for generating preference signals
- Trained a standalone preference model with approximately 220M parameters using axiomatic signals
- Preference Model can score both human- and LLM-generated answers on the same scale
- Contributions include outperforming GPT-4 in preference scoring, developing axiomatic framework for tailored training data pairs, and demonstrating improvement over GPT-4 with small amount of axiomatic signals
- Model released on huggingface, providing access to research findings and approach
- Paper accepted for presentation at EMNLP 2023.

A study looked at using a big computer program to answer long questions. They focused on how the program can learn from feedback given by humans. This feedback is like a reward for the program. They found some rules to help make the feedback better and made a special model with lots of parameters to use this feedback. The model can compare answers from humans and the computer program. The study showed that their model was better than another one called GPT-4. They shared their findings and approach with others through a website called huggingface, and their paper will be presented at a conference called EMNLP 2023. Definitions- Large language models (LLMs): Big computer programs that understand and generate human-like language. - Reinforcement Learning from Human Feedback (RLHF): A process where a computer program learns by getting rewards or feedback from humans. - Reward model (RM): A way to show what is good or bad for the computer program, like giving it points for doing well. - Axiomatic framework: A set of rules or principles used to guide decision-making in the computer program. - Parameters: Settings or variables that determine how the computer program works. - Preference signals: Signals or information about what humans prefer or like better. - Outperforming: Doing better than something else. - Access: Being able to see or use something. - EMNLP 2023: A conference where people share research about natural language processing, which is how computers understand and generate human-like

Axiomatic Preference Modeling for Longform Question Answering

In recent years, large language models (LLMs) such as GPT-4 have been used to generate answers to longform questions. While these models are capable of producing accurate results, they lack the ability to interpret human preferences and provide tailored responses. To address this issue, researchers at the University of California San Diego have developed an axiomatic framework for preference modeling that can be used in conjunction with LLMs to better align with human preferences. This research has been accepted for presentation at EMNLP 2023 and is now available on huggingface.

Background

The authors of this paper focus on the post-training process of Reinforcement Learning from Human Feedback (RLHF), which involves encoding human preferences into a reward model (RM). However, RMs often lack direct knowledge of why certain annotations were made or how different signals should be interpreted. The authors identified principles that guide RMs towards better alignment with human preferences and developed an axiomatic framework to generate a variety of preference signals based on these principles.

Methodology

To test their approach, the authors trained a standalone preference model using approximately 220M parameters. This Preference Model was then tested against GPT-4 in terms of its ability to score both human- and LLM-generated answers on the same scale. The results showed that even a small amount of axiomatic signals could help smaller models outperform GPT-4 in preference scoring tasks.

Conclusion

This study demonstrates that by utilizing an axiomatic framework for generating tailored training data pairs based on certain principles, it is possible to train a preference model capable of outperforming GPT-4 in preference scoring tasks. Furthermore, this research provides access to their findings and approach through their release on huggingface platform, making it easier for other researchers interested in exploring similar topics or applications related to RLHF and LLMs more generally.

Created on 06 Dec. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

62.8%

LIMA: Less Is More for Alignment

cs.CL

61.5%

Training a Helpful and Harmless Assistant with Reinforcement Learning from Hu…

cs.CL

61.2%

Constitutional AI: Harmlessness from AI Feedback

cs.CL

59.9%

WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Huma…

cs.CL

59.6%

LLMs may Dominate Information Access: Neural Retrievers are Biased Towards LL…

cs.IR

59.5%

Towards Expert-Level Medical Question Answering with Large Language Models

cs.CL

58.4%

Do LLMs Understand User Preferences? Evaluating LLMs On User Rating Prediction

cs.IR

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.