What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length

AI-generated keywords: LM Acceptability Judgments

AI-generated Key Points

  • Factors such as sequence length and unigram frequency significantly impact LM probabilities
  • Humans are more robust to the effects of sequence length and unigram frequency compared to LMs
  • Introduction of MORCELA, a new linking theory that estimates optimal adjustments for length and unigram frequency
  • MORCELA outperforms the commonly used SLOR theory in predicting acceptability across two transformer LM families
  • Larger models require lower adjustment for unigram frequency but still need significant adjustments overall
  • Larger LMs' reduced susceptibility to frequency effects may be due to their ability to predict rarer words in context
  • Evaluations of probability-based LM acceptability judgments should consider model-specific qualities related to factors like frequency and length
  • There is a gap between the maximum correlation between LMs and human judgments and inter-annotator agreement
  • Future research could explore additional factors or transformations to enhance alignment between LMs and human judgment
  • Study's findings have broader implications for understanding how different factors influence LM acceptability judgments
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Lindia Tjuatja, Graham Neubig, Tal Linzen, Sophie Hao

License: CC BY-NC-SA 4.0

Abstract: When comparing the linguistic capabilities of language models (LMs) with humans using LM probabilities, factors such as the length of the sequence and the unigram frequency of lexical items have a significant effect on LM probabilities in ways that humans are largely robust to. Prior works in comparing LM and human acceptability judgments treat these effects uniformly across models, making a strong assumption that models require the same degree of adjustment to control for length and unigram frequency effects. We propose MORCELA, a new linking theory between LM scores and acceptability judgments where the optimal level of adjustment for these effects is estimated from data via learned parameters for length and unigram frequency. We first show that MORCELA outperforms a commonly used linking theory for acceptability--SLOR (Pauls and Klein, 2012; Lau et al. 2017)--across two families of transformer LMs (Pythia and OPT). Furthermore, we demonstrate that the assumed degrees of adjustment in SLOR for length and unigram frequency overcorrect for these confounds, and that larger models require a lower relative degree of adjustment for unigram frequency, though a significant amount of adjustment is still necessary for all models. Finally, our subsequent analysis shows that larger LMs' lower susceptibility to frequency effects can be explained by an ability to better predict rarer words in context.

Submitted to arXiv on 04 Nov. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2411.02528v1

In their study "What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length," Tjuatja, Neubig, Linzen, and Hao explore the factors influencing language model (LM) acceptability judgments compared to human judgments. They find that factors such as sequence length and unigram frequency significantly impact LM probabilities, whereas humans are more robust to these effects. The researchers introduce MORCELA, a new linking theory that estimates the optimal level of adjustment for these effects using learned parameters for length and unigram frequency. Their analysis shows that MORCELA outperforms the commonly used SLOR theory in predicting acceptability across two transformer LM families. They also discover that larger models require a lower degree of adjustment for unigram frequency but still need significant adjustments overall. Additionally, they reveal that larger LMs' reduced susceptibility to frequency effects may be attributed to their ability to predict rarer words in context. The authors suggest that evaluations of probability-based LM acceptability judgments should consider model-specific qualities related to factors like frequency and length. By doing so, they believe LMs may be more closely correlated with human judgments than previously assumed. However, there remains a notable gap between the maximum correlation between LMs and human judgments and inter-annotator agreement. Future research could explore additional factors or transformations to enhance the alignment between LMs and human judgment, potentially informing the development of more cognitively plausible models. While their evaluations are limited to English data and annotations by English speakers, the study's findings have broader implications for understanding how different factors influence LM acceptability judgments. The researchers suggest further investigation into integrating these insights into training models that better align with human cognitive processes.
Created on 04 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.