Can Large Language Models Infer Causation from Correlation?

AI-generated keywords: Causal Inference Corr2Cause LLMs Reasoning Skills Generalizability

AI-generated Key Points

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, Bernhard Schölkopf

License: CC BY-SA 4.0

Abstract: Causal inference is one of the hallmarks of human intelligence. While the field of CausalNLP has attracted much interest in the recent years, existing causal inference datasets in NLP primarily rely on discovering causality from empirical knowledge (e.g., commonsense knowledge). In this work, we propose the first benchmark dataset to test the pure causal inference skills of large language models (LLMs). Specifically, we formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We curate a large-scale dataset of more than 400K samples, on which we evaluate seventeen existing LLMs. Through our experiments, we identify a key shortcoming of LLMs in terms of their causal inference skills, and show that these models achieve almost close to random performance on the task. This shortcoming is somewhat mitigated when we try to re-purpose LLMs for this skill via finetuning, but we find that these models still fail to generalize -- they can only perform causal inference in in-distribution settings when variable names and textual expressions used in the queries are similar to those in the training set, but fail in out-of-distribution settings generated by perturbing these queries. Corr2Cause is a challenging task for LLMs, and would be helpful in guiding future research on improving LLMs' pure reasoning skills and generalizability. Our data is at https://huggingface.co/datasets/causalnlp/corr2cause. Our code is at https://github.com/causalNLP/corr2cause.

Submitted to arXiv on 09 Jun. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2306.05836v1

Causal inference is a fundamental aspect of human intelligence and the field of CausalNLP has gained significant interest in recent years. Existing causal inference datasets in NLP primarily rely on discovering causality from empirical knowledge such as commonsense knowledge. To address this gap, a team of researchers proposed Corr2Cause, the first benchmark dataset to test the pure causal inference skills of large language models (LLMs). The task involves taking a set of correlational statements and determining the causal relationship between variables. The team curated a large-scale dataset comprising over 400K samples and evaluated seventeen existing LLMs on it. Their experiments identified a key shortcoming of LLMs in terms of their causal inference skills; they achieved almost random performance on the task, indicating that they lack pure reasoning skills for causal inference. Although re-purposing LLMs via finetuning somewhat mitigated this shortcoming, these models still failed to generalize beyond in-distribution settings where variable names and textual expressions used in queries were similar to those in the training set. The Corr2Cause task presents a challenging problem for LLMs and highlights areas for future research aimed at improving their pure reasoning skills and generalizability. The dataset is available online along with code for further exploration by interested parties. Additionally, the researchers proposed several potential solutions to improve LLMs' performance on this task, including incorporating domain-specific information or developing novel architectures that explicitly model causality. Overall, this work provides valuable insights into the limitations of current LLMs regarding causal inference and highlights opportunities for future research aimed at enhancing their capabilities in this critical area.
Created on 12 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.