Robustness Assessment of Mathematical Reasoning in the Presence of Missing and Contradictory Conditions
AI-generated Key Points
- Few-shot prompting techniques have enhanced the performance of Large Language Models (LLMs) on reasoning tasks.
- Existing evaluations focus on well-structured benchmarks and neglect real-world reasoning problems with missing and contradictory conditions, known as ill-defined problems.
- The gap in evaluation has shown that current few-shot prompting methods struggle with handling ill-defined problems, leading to overconfident answers or hallucinations.
- A new benchmark called Problems with Missing and Contradictory Conditions (PMC) has been developed to assess few-shot prompting methods' performance on ill-defined problems, introducing two novel metrics for evaluation.
- A trade-off dilemma exists between mathematical reasoning for well-defined problems and recognizing ill-defined problems when using few-shot prompting methods.
- The SMT-LIB Prompting (SLP) method has been proposed to address these challenges by utilizing the SMT-LIB language to model problems instead of solving them directly. It employs a double-check solving strategy for verifying solutions' satisfiability and uniqueness, leading to more accurate feedback.
- Extensive experiments have demonstrated the superiority of the SLP approach compared to existing few-shot prompting methods in tackling problems with missing and contradictory conditions.
- Related work includes advancements in CoT-type methods, program-type methods, ensemble-optimized approaches for few-shot prompting techniques; strategies like Metamath, WizardMath, Mugglemath, Mathvista for enhancing mathematical reasoning skills; and research on perturbations to model inputs and noisy ground truth prompting for LLM robustness.
Authors: Shi-Yu Tian, Zhi Zhou, Lin-Han Jia, Lan-Zhe Guo, Yu-Feng Li
Abstract: Large language models (LLMs) have demonstrated impressive performance on reasoning tasks, which can be further improved through few-shot prompting techniques. However, the current evaluation primarily focuses on carefully constructed benchmarks and neglects the consideration of real-world reasoning problems that present missing and contradictory conditions, known as ill-defined problems. Our observations suggest that existing few-shot prompting techniques are ineffective in such scenarios, often providing overconfident answers or hallucination. To further study this problem, we develop a benchmark called Problems with Missing and Contradictory conditions (PMC) and introduce two novel metrics to evaluate the performance of few-shot prompting methods in these scenarios. Our analysis using the PMC benchmark reveals a trade-off dilemma between the performance of mathematical reasoning for well-defined problems and the ability to recognize ill-defined problems. To address the challenges posed by PMC, we propose a novel few-shot prompting method called SMT-LIB Prompting (SLP), which utilizes the SMT-LIB language to model the problems instead of solving them directly. Subsequently, a double-check solving strategy checks the satisfiability and uniqueness of the solution and provides final feedback. Extensive experiments demonstrate the superiority of our SLP approach compared to existing few-shot prompting methods when dealing with problems with missing and contradictory conditions. We will open-source our benchmark and code to facilitate future research.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.