Are Your LLMs Capable of Stable Reasoning?

AI-generated keywords: Large Language Models Stable Reasoning Evaluation Metrics G-Pass@k LiveMathBench

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors address the gap between benchmark performances and real-world applications in Large Language Models (LLMs)
  • Current evaluation protocols and metrics fail to fully capture the diverse capabilities of LLMs, especially in complex reasoning tasks
  • Introduce G-Pass@k as a novel evaluation metric for continuous assessment of model performance and stability
  • Present LiveMathBench as a dynamic benchmark for challenging mathematical problems to minimize data leakage risks during evaluation
  • Extensive experiments using G-Pass@k on cutting-edge LLMs with LiveMathBench provide insights into maximum capabilities and operational consistency of these models
  • Findings highlight room for improvement in LLMs' "realistic" reasoning abilities and emphasize the need for more robust evaluation methods
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, Kai Chen

Preprint

Abstract: The rapid advancement of Large Language Models (LLMs) has demonstrated remarkable progress in complex reasoning tasks. However, a significant discrepancy persists between benchmark performances and real-world applications. We identify this gap as primarily stemming from current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, particularly in complex reasoning tasks where both accuracy and consistency are crucial. This work makes two key contributions. First, we introduce G-Pass@k, a novel evaluation metric that provides a continuous assessment of model performance across multiple sampling attempts, quantifying both the model's peak performance potential and its stability. Second, we present LiveMathBench, a dynamic benchmark comprising challenging, contemporary mathematical problems designed to minimize data leakage risks during evaluation. Through extensive experiments using G-Pass@k on state-of-the-art LLMs with LiveMathBench, we provide comprehensive insights into both their maximum capabilities and operational consistency. Our findings reveal substantial room for improvement in LLMs' "realistic" reasoning capabilities, highlighting the need for more robust evaluation methods. The benchmark and detailed results are available at: https://github.com/open-compass/GPassK.

Submitted to arXiv on 17 Dec. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2412.13147v2

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "Are Your LLMs Capable of Stable Reasoning? ", authors Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang and Kai Chen address the significant gap between benchmark performances and real-world applications in Large Language Models (LLMs). They argue that this disconnect is primarily due to current evaluation protocols and metrics failing to fully capture the diverse capabilities of LLMs. This is especially evident in complex reasoning tasks where accuracy and consistency are paramount. To bridge this divide, the authors introduce two key contributions. Firstly, they propose G-Pass@k - a novel evaluation metric that offers a continuous assessment of model performance across multiple sampling attempts. This metric not only quantifies the peak performance potential of LLMs but also evaluates their stability over various scenarios. Secondly, they present LiveMathBench - a dynamic benchmark consisting of challenging mathematical problems designed to minimize data leakage risks during evaluation. Through extensive experiments utilizing G-Pass@k on cutting-edge LLMs with LiveMathBench, the authors provide comprehensive insights into both the maximum capabilities and operational consistency of these models. Their findings underscore substantial room for improvement in LLMs' "realistic" reasoning abilities and emphasize the necessity for more robust evaluation methods in assessing these advanced language models. The detailed results and benchmark information can be accessed at: https://github.com/open-compass/GPassK.
Created on 21 Dec. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.