Can Large Language Models Infer and Disagree Like Humans?

AI-generated keywords: Large Language Models Natural Language Inference Human Disagreement Distribution Model Performance Evaluation Benchmarks

AI-generated Key Points

  • Large Language Models (LLMs) have impressive performance in various tasks
  • LLMs' ability to align with human disagreement distribution and solve Natural Language Inference (NLI) tasks is not well-studied
  • Authors evaluate LLMs' performance and alignment using Monte Carlo Reconstruction (MCR) and Log Probability Reconstruction (LPR)
  • LLMs exhibit limited ability in solving NLI tasks and fail to capture human disagreement distribution effectively
  • Concerns raised about their natural language understanding (NLU) ability and representativeness of human users
  • Generative LLMs compared with fully fine-tuned smaller models on an NLI task, but none achieve performances closely aligned with human accuracy and disagreement levels
  • Further research needed to improve LLM performance in capturing human distribution at a population level
  • Limitations include the possibility that 100 annotators may not be sufficient to represent full range of human disagreement distribution accurately
  • Future studies could benefit from datasets with more diverse label variations to create evaluation benchmarks for measuring disagreement levels
  • Discrepancy between continuous scores provided by humans and discretized scores used during training may contribute to misalignment between LLMs and human performance in NLI tasks
  • Limited ability of billion-scale LLMs highlighted in solving NLI tasks and capturing human disagreement distribution effectively
  • Need for further research to improve LLM performance and understand factors contributing to disagreement in LLMs compared to humans.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Noah Lee, Na Min An, James Thorne

License: CC BY 4.0

Abstract: Large Language Models (LLMs) have shown stellar achievements in solving a broad range of tasks. When generating text, it is common to sample tokens from these models: whether LLMs closely align with the human disagreement distribution has not been well-studied, especially within the scope of Natural Language Inference (NLI). In this paper, we evaluate the performance and alignment of LLM distribution with humans using two different techniques: Monte Carlo Reconstruction (MCR) and Log Probability Reconstruction (LPR). As a result, we show LLMs exhibit limited ability in solving NLI tasks and simultaneously fail to capture human disagreement distribution, raising concerns about their natural language understanding (NLU) ability and their representativeness of human users.

Submitted to arXiv on 23 May. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2305.13788v1

Large Language Models (LLMs) have gained significant attention for their impressive performance in various tasks. However, their ability to align with human disagreement distribution and accurately solve Natural Language Inference (NLI) tasks has not been thoroughly studied. In this paper, the authors evaluate the performance and alignment of LLMs with humans using two different techniques: Monte Carlo Reconstruction (MCR) and Log Probability Reconstruction (LPR). The results of the study show that LLMs exhibit limited ability in solving NLI tasks and fail to capture human disagreement distribution effectively. This raises concerns about their natural language understanding (NLU) ability and their representativeness of human users. The authors compare the performances of generative LLMs with other fully fine-tuned smaller models on a fundamental NLI task. Despite variations in model distribution through different reconstruction methods and prompt types, none of the models achieve performances that closely align with the accuracy and disagreement levels observed in the human population. This indicates that further research is needed to improve LLM performance in capturing human distribution at a population level. The study acknowledges some limitations, including the possibility that 100 annotators may not be sufficient to represent the full range of human disagreement distribution accurately. Future studies could benefit from datasets with more diverse label variations to cover a wider range of model types and create evaluation benchmarks for measuring disagreement levels. The authors also discuss a potential reason for LLMs' underperformance in NLI tasks compared to humans. They hypothesize that it may be due to how data was collected, where annotators were asked to predict numerical scores for strength which were later discretized. This discrepancy between continuous scores provided by humans and discretized scores used during training might contribute to the misalignment between LLMs and human performance. In conclusion, this paper highlights the limited ability of billion-scale LLMs in solving NLI tasks and capturing human disagreement distribution effectively. It emphasizes the need for further research to improve LLM performance and understand the latent factors that contribute to disagreement in LLMs compared to humans.
Created on 18 Sep. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.