To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
AI-generated Key Points
- Chain-of-Thought (CoT) via prompting is a prominent method for eliciting reasoning capabilities from large language models (LLMs).
- A quantitative meta-analysis of over 100 papers and evaluations of 20 datasets across 14 models showed that CoT offers significant performance benefits, especially in tasks involving math or logic.
- CoT enhances symbolic execution but falls short compared to using a symbolic solver.
- CoT is beneficial for applications requiring long-horizon planning and symbolic reasoning, but there is ongoing debate about its effectiveness.
- Variants like tree-of-thought have been explored to address complex planning problems better.
- Leveraging additional calls to LLMs could enhance CoT further, but careful benchmarking is essential to determine the most effective approach.
- Dataset contamination may lead to potential bias in results due to model memorization of answers.
- The meta-analysis concludes that CoT notably improves tasks related to symbolic reasoning, math, and logical reasoning.
Authors: Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett
Abstract: Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.