Self-Consistency Improves Chain of Thought Reasoning in Language Models

AI-generated keywords: Decoding Strategy

AI-generated Key Points

  • Introduction of self-consistency decoding strategy to enhance chain-of-thought prompting in complex reasoning tasks
  • Self-consistency method simulates diverse human thinking by sampling multiple reasoning paths from language models
  • Demonstrated improvement in accuracy across arithmetic and commonsense reasoning benchmarks with self-consistency
  • Benefits of self-consistency include aiding in collecting rationales, providing better uncertainty estimates, and improving calibration of language model outputs
  • Use of a small number of paths (e.g., 5 or 10) can yield substantial gains without significant overhead
  • Potential for leveraging self-consistency to generate better supervised data for model fine-tuning and more accurate predictions with fewer inference runs
  • Inclusion of various language models in experiments, including UL2 and GPT-3, with detailed information on result reproduction using publicly available resources
  • Ethical considerations raised regarding biases or inaccuracies in language model outputs and the importance of ongoing efforts to improve model factuality and safety
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou

Published at ICLR 2023. V2: added PaLM results; V3: added UL2 results; V4: camera ready version at ICLR 2023
License: CC BY 4.0

Abstract: Chain-of-thought prompting combined with pre-trained large language models has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer. Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%).

Submitted to arXiv on 21 Mar. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2203.11171v4

In this paper, the authors introduce a novel decoding strategy called self-consistency to enhance the performance of chain-of-thought prompting in complex reasoning tasks. The self-consistency method aims to simulate the diverse ways in which humans think by sampling multiple reasoning paths from language models and selecting the most consistent answer among them. This approach acknowledges that there are often multiple valid ways to arrive at a correct solution in complex reasoning problems. The study demonstrates that self-consistency significantly improves accuracy across various arithmetic and commonsense reasoning benchmarks when applied to different large language models. Not only does self-consistency boost performance, but it also aids in collecting rationales during reasoning tasks and provides better uncertainty estimates and calibration of language model outputs. While self-consistency may require additional computation cost due to sampling multiple paths, the authors suggest that using a small number of paths (e.g., 5 or 10) can still yield substantial gains without significant overhead. Future work could explore leveraging self-consistency to generate better supervised data for model fine-tuning, leading to more accurate predictions with fewer inference runs. The inclusion of four different language models with varying scales in the experiments, including public models like UL2 and GPT-3, is highlighted. The authors provide detailed information on how others can reproduce their results using publicly available resources. Additionally, ethical considerations are raised regarding potential biases or inaccuracies in language model outputs, emphasizing the need for caution when interpreting results and ongoing efforts to improve model factuality and safety for real-world applications. Overall, this paper presents a compelling argument for incorporating self-consistency into chain-of-thought prompting for improved performance on complex reasoning tasks while also addressing important considerations around reproducibility and ethics in utilizing language models for decision-making processes.
Created on 16 Apr. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.