Confidence Improves Self-Consistency in LLMs

AI-generated keywords: Confidence-Informed Self-Consistency

AI-generated Key Points

Introduction of Confidence-Informed Self-Consistency (CISC) as a decoding strategy to enhance Large Language Models (LLMs) performance on reasoning tasks
CISC uses confidence scores obtained directly from the model to implement a weighted majority vote, prioritizing high-confidence paths for identifying correct answers with smaller sample sizes
Experiments showed that CISC outperformed traditional self-consistency in various configurations, reducing required reasoning paths by over 40% on average
Within-question confidence evaluation introduced for improved accuracy, showing standard evaluation methods are poor predictors of success in distinguishing correct and incorrect answers
Qualitative analysis revealed agreement between model confidence scores and human assessments of reasoning paths' quality, suggesting LLMs can self-assess their responses

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, Gal Yona

arXiv: 2502.06233v1 - DOI (cs.CL)

License: CC BY 4.0

Abstract: Self-consistency decoding enhances LLMs' performance on reasoning tasks by sampling diverse reasoning paths and selecting the most frequent answer. However, it is computationally expensive, as sampling many of these (lengthy) paths is required to increase the chances that the correct answer emerges as the most frequent one. To address this, we introduce Confidence-Informed Self-Consistency (CISC). CISC performs a weighted majority vote based on confidence scores obtained directly from the model. By prioritizing high-confidence paths, it can identify the correct answer with a significantly smaller sample size. When tested on nine models and four datasets, CISC outperforms self-consistency in nearly all configurations, reducing the required number of reasoning paths by over 40% on average. In addition, we introduce the notion of within-question confidence evaluation, after showing that standard evaluation methods are poor predictors of success in distinguishing correct and incorrect answers to the same question. In fact, the most calibrated confidence method proved to be the least effective for CISC. Lastly, beyond these practical implications, our results and analyses show that LLMs can effectively judge the correctness of their own outputs, contributing to the ongoing debate on this topic.

Submitted to arXiv on 10 Feb. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2502.06233v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In this study, the researchers introduce Confidence-Informed Self-Consistency (CISC) as a decoding strategy to enhance the performance of Large Language Models (LLMs) on reasoning tasks. The traditional self-consistency decoding method is effective but computationally expensive, requiring sampling of numerous reasoning paths to increase the chances of selecting the correct answer. CISC addresses this issue by implementing a weighted majority vote based on confidence scores obtained directly from the model. By prioritizing high-confidence paths, CISC can identify the correct answer with a significantly smaller sample size, reducing computational costs. The researchers conducted experiments comparing CISC and self-consistency across various confidence extraction methods, reasoning tasks, and LLM models. They found that CISC outperformed self-consistency in nearly all configurations, reducing the required number of reasoning paths by over 40% on average. Additionally, they introduced within-question confidence evaluation for improved accuracy, demonstrating that standard evaluation methods are poor predictors of success in distinguishing correct and incorrect answers to the same question. Furthermore, a qualitative analysis revealed a significant agreement between model confidence scores and human assessments of reasoning paths' quality. Responses identified by the model as low-confidence were more likely to be flagged by human evaluators as exhibiting signs of low-quality reasoning patterns. This suggests that LLMs are capable of self-assessing their responses. Overall, this study contributes practical methods and foundational insights in the field of natural language processing. It not only proposes CISC as an efficient alternative to self-consistency for LLMs but also introduces within-question confidence evaluation for improved accuracy and provides empirical evidence supporting LLMs' ability to self-assess their outputs. These findings have practical implications for improving LLM performance and contribute to ongoing debates about LLM capabilities in judging the correctness of their own outputs.

- Introduction of Confidence-Informed Self-Consistency (CISC) as a decoding strategy to enhance Large Language Models (LLMs) performance on reasoning tasks
- CISC uses confidence scores obtained directly from the model to implement a weighted majority vote, prioritizing high-confidence paths for identifying correct answers with smaller sample sizes
- Experiments showed that CISC outperformed traditional self-consistency in various configurations, reducing required reasoning paths by over 40% on average
- Within-question confidence evaluation introduced for improved accuracy, showing standard evaluation methods are poor predictors of success in distinguishing correct and incorrect answers
- Qualitative analysis revealed agreement between model confidence scores and human assessments of reasoning paths' quality, suggesting LLMs can self-assess their responses

SummaryConfidence-Informed Self-Consistency (CISC) is a way to help big language models do better at thinking tasks. CISC uses how sure the model is about its answers to pick the best ones, making it faster and more accurate. Tests showed that CISC works better than older methods, needing fewer tries to get the right answer. Checking how confident the model is within each question helps make it even more accurate. People agree that when the model is confident, it usually means it did a good job. Definitions- Confidence: How sure you are about something. - Self-consistency: Making sure your answers match up with each other. - Decoding strategy: A plan for figuring out what something means. - Large Language Models (LLMs): Big computer programs that understand and generate human language. - Reasoning tasks: Figuring things out by thinking logically.

Introduction

Natural Language Processing (NLP) has made significant advancements in recent years, with the development of Large Language Models (LLMs) being at the forefront. These models have shown impressive capabilities in various tasks such as language translation, text summarization, and question-answering. However, one area where LLMs still struggle is reasoning tasks. Reasoning involves understanding complex relationships between different pieces of information and using that understanding to arrive at a logical conclusion. While humans excel at this task, it remains a challenge for machines. In this research paper titled "Confidence-Informed Self-Consistency for Large Language Models," the authors introduce a new decoding strategy called Confidence-Informed Self-Consistency (CISC) to improve LLM performance on reasoning tasks. The traditional self-consistency method has been effective but computationally expensive, requiring sampling of numerous reasoning paths to increase the chances of selecting the correct answer. CISC addresses this issue by implementing a weighted majority vote based on confidence scores obtained directly from the model.

The Study

The researchers conducted experiments comparing CISC and self-consistency across various confidence extraction methods, reasoning tasks, and LLM models. They used two popular LLMs - GPT-3 and T5 - and evaluated their performance on three types of reasoning tasks: arithmetic word problems, multiple-choice questions from standardized tests like SAT and GRE, and science questions from middle school exams. For each task type, they compared four different confidence extraction methods: maximum softmax probability (MaxProb), mean softmax probability (MeanProb), entropy-based uncertainty estimation (Entropy), and within-question evaluation using human annotators' judgments (Human). The results showed that CISC outperformed self-consistency in nearly all configurations. On average, CISC reduced the required number of reasoning paths by over 40%, significantly reducing computational costs while maintaining or even improving accuracy. This is a significant improvement, as the traditional self-consistency method can be computationally expensive and time-consuming.

Within-Question Confidence Evaluation

One of the key contributions of this study is the introduction of within-question confidence evaluation. The researchers found that standard evaluation methods, such as MaxProb and MeanProb, are poor predictors of success in distinguishing correct and incorrect answers to the same question. This means that these methods may not accurately reflect an LLM's ability to reason. To address this issue, the researchers introduced within-question confidence evaluation, where they compared model confidence scores for different reasoning paths within the same question. They found that this method significantly improved accuracy compared to standard evaluation methods. This suggests that evaluating an LLM's performance on a per-question basis can provide more accurate results than overall performance metrics.

Model Self-Assessment

Another interesting finding from this study is that there was a significant agreement between model confidence scores and human assessments of reasoning paths' quality. Responses identified by the model as low-confidence were more likely to be flagged by human evaluators as exhibiting signs of low-quality reasoning patterns. This suggests that LLMs are capable of self-assessing their responses. This finding has practical implications for improving LLM performance on reasoning tasks. By identifying low-confidence responses, we can potentially improve models' training data or fine-tune them to better handle specific types of questions or concepts.

Conclusion

In conclusion, "Confidence-Informed Self-Consistency for Large Language Models" introduces CISC as an efficient alternative to self-consistency for LLMs on reasoning tasks. It also highlights the importance of within-question confidence evaluation for improved accuracy and provides evidence supporting LLMs' ability to self-assess their outputs. The findings from this study have practical implications for improving LLM performance on reasoning tasks and contribute to ongoing debates about LLM capabilities in judging the correctness of their own outputs. This research opens up new avenues for future studies on improving LLMs' reasoning abilities and highlights the potential of using confidence scores as a tool for model self-assessment. Overall, this study makes significant contributions to the field of NLP and provides practical methods and foundational insights that can be applied to enhance LLM performance on reasoning tasks. With further advancements in this area, we can expect even more impressive results from LLMs in the future.

Created on 18 Oct. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

60.5%

Self-Consistency Improves Chain of Thought Reasoning in Language Models

cs.CL

59.5%

Zero-Shot Verification-guided Chain of Thoughts

cs.CL

59.1%

Towards Expert-Level Medical Question Answering with Large Language Models

cs.CL

58.7%

PiCSAR: Probabilistic Confidence Selection And Ranking for Reasoning Chains

cs.CL

57.9%

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

cs.CL

57.2%

LLM Post-Training: A Deep Dive into Reasoning Large Language Models

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.