BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information

AI-generated keywords: Defeasible Reasoning

AI-generated Key Points

  • Automated reasoning with unstructured natural text is important for many NLP and AI applications.
  • Existing evaluation assumes consistent information, but real-world information is often inconsistent or contradictory.
  • Google Research developed BoardgameQA dataset to measure LM reasoning capacity in the presence of conflicting input sources.
  • BoardgameQA incorporates reasoning with implicit background knowledge and scenarios where additional information needs to come from the model itself.
  • LMs perform poorly when reasoning with conflicting inputs and additional knowledge is needed.
  • Resolving conflicts by adopting the source with higher preference can be effective.
  • BoardgameQA highlights an important gap in current LMs' understanding capacity and can guide future work to improve their understanding ability under this setup.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Mehran Kazemi, Quan Yuan, Deepti Bhatia, Najoung Kim, Xin Xu, Vaiva Imbrasaite, Deepak Ramachandran

License: CC BY 4.0

Abstract: Automated reasoning with unstructured natural text is a key requirement for many potential applications of NLP and for developing robust AI systems. Recently, Language Models (LMs) have demonstrated complex reasoning capacities even without any finetuning. However, existing evaluation for automated reasoning assumes access to a consistent and coherent set of information over which models reason. When reasoning in the real-world, the available information is frequently inconsistent or contradictory, and therefore models need to be equipped with a strategy to resolve such conflicts when they arise. One widely-applicable way of resolving conflicts is to impose preferences over information sources (e.g., based on source credibility or information recency) and adopt the source with higher preference. In this paper, we formulate the problem of reasoning with contradictory information guided by preferences over sources as the classical problem of defeasible reasoning, and develop a dataset called BoardgameQA for measuring the reasoning capacity of LMs in this setting. BoardgameQA also incorporates reasoning with implicit background knowledge, to better reflect reasoning problems in downstream applications. We benchmark various LMs on BoardgameQA and the results reveal a significant gap in the reasoning capacity of state-of-the-art LMs on this problem, showing that reasoning with conflicting information does not surface out-of-the-box in LMs. While performance can be improved with finetuning, it nevertheless remains poor.

Submitted to arXiv on 13 Jun. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2306.07934v1

Automated reasoning with unstructured natural text is a crucial requirement for many potential applications of Natural Language Processing (NLP) and for developing robust Artificial Intelligence (AI) systems. While Language Models (LMs) have demonstrated complex reasoning capacities even without any finetuning, existing evaluation for automated reasoning assumes access to a consistent and coherent set of information over which models reason. However, in the real-world, available information is frequently inconsistent or contradictory, and therefore models need to be equipped with a strategy to resolve such conflicts when they arise. To address this issue, researchers at Google Research have formulated the problem of reasoning with contradictory information guided by preferences over sources as the classical problem of defeasible reasoning and developed a dataset called BoardgameQA for measuring the reasoning capacity of LMs in this setting. BoardgameQA incorporates reasoning with implicit background knowledge to better reflect reasoning problems in downstream applications. The dataset includes scenarios where knowledge required for reasoning is only partially provided as input and additional information needs to come from the model itself. The researchers benchmarked various LMs on BoardgameQA and observed that LMs perform poorly when reasoning with conflicting inputs. In the case of smaller models, performance was also poor when additional knowledge from the LM was needed. Since reasoning over contradicting and incomplete sets of information is a common scenario in real-world applications, these results highlight an important gap in the current LMs' reasoning capacity. The work spans three dimensions: text-based logical reasoning, reasoning with conflicting sources, and incomplete information. Earlier works on natural language logical reasoning have finetuned LMs to directly provide answers to logical questions. Later work showed that explicitly generating the entire proof leads to substantial improvements both in accuracy and interpretability. One widely-applicable way of resolving conflicts is to impose preferences over information sources based on source credibility or recency and adopt the source with higher preference. In conclusion, BoardgameQA provides a valuable resource for measuring the natural language reasoning ability of LMs in the presence of conflicting input sources. The dataset highlights an important gap in the current LMs' understanding capacity and can guide future work developing methodology to improve their understanding ability under this setup or finding alternative formulations of conflict resolution that better facilitate LM understanding.
Created on 14 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.