SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation

AI-generated keywords: SpreadsheetBench benchmark large language models (LLMs) evaluation metric multi-round prompting

AI-generated Key Points

  • SpreadsheetBench is a benchmark designed to test large language models (LLMs) in manipulating real-world spreadsheet scenarios
  • Constructed from 912 real questions sourced from online Excel forums, reflecting complex user needs and challenges
  • Benchmark includes diverse tabular data structures such as multiple tables, non-standard relational tables, and non-textual elements
  • Proposed evaluation metric involves creating multiple spreadsheet files as test cases for each instruction to ensure robust solutions
  • Significant performance gap between state-of-the-art models and human users revealed through comprehensive evaluation under single-round and multi-round inference settings
  • Comparison with previous benchmarks like SheetCopilotBench highlights complexity and realism of instructions in SpreadsheetBench
  • Spreadsheets in SpreadsheetBench feature flexible data organization with non-standard relational tables, textual information, and non-textual elements like colors
  • Evaluation includes different types of models showing varying performance scores ranging from 0.05% to 23.65%
  • Importance of enhancing coding abilities within LLMs for spreadsheet manipulation tasks emphasized
  • Potential benefits of multi-round prompting to improve response accuracy highlighted
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, Jie Tang

Homepage: https://spreadsheetbench.github.io/
License: CC BY-SA 4.0

Abstract: We introduce SpreadsheetBench, a challenging spreadsheet manipulation benchmark exclusively derived from real-world scenarios, designed to immerse current large language models (LLMs) in the actual workflow of spreadsheet users. Unlike existing benchmarks that rely on synthesized queries and simplified spreadsheet files, SpreadsheetBench is built from 912 real questions gathered from online Excel forums, which reflect the intricate needs of users. The associated spreadsheets from the forums contain a variety of tabular data such as multiple tables, non-standard relational tables, and abundant non-textual elements. Furthermore, we propose a more reliable evaluation metric akin to online judge platforms, where multiple spreadsheet files are created as test cases for each instruction, ensuring the evaluation of robust solutions capable of handling spreadsheets with varying values. Our comprehensive evaluation of various LLMs under both single-round and multi-round inference settings reveals a substantial gap between the state-of-the-art (SOTA) models and human performance, highlighting the benchmark's difficulty.

Submitted to arXiv on 21 Jun. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2406.14991v1

In this study, the researchers introduce SpreadsheetBench, a challenging benchmark specifically designed to test the capabilities of large language models (LLMs) in manipulating real-world spreadsheet scenarios. Unlike existing benchmarks that rely on synthesized queries and simplified spreadsheet files, SpreadsheetBench is constructed from 912 real questions sourced from online Excel forums. This reflects the complex needs and challenges faced by users. The associated spreadsheets in the benchmark contain diverse tabular data structures such as multiple tables, non-standard relational tables, and various non-textual elements. The researchers propose a more reliable evaluation metric inspired by online judge platforms. This involves creating multiple spreadsheet files as test cases for each instruction to ensure robust solutions capable of handling different types of spreadsheets. Through a comprehensive evaluation of various LLMs under both single-round and multi-round inference settings, the study reveals a significant performance gap between state-of-the-art models and human users. This underscores the difficulty of the benchmark. Furthermore, comparisons are made between previous benchmarks like SheetCopilotBench and SpreadsheetBench. These highlight the complexity and realism of instructions in SpreadsheetBench. The spreadsheets in SpreadsheetBench feature flexible data organization with non-standard relational tables and cells containing textual information and non-textual elements like colors. Additionally, manipulation categories involve spreadsheets with tables extending beyond 100 columns and 20,000 rows. The study evaluates different types of models including TableQA models, open-source LLMs for general tasks and coding tasks, advanced close-source models, and spreadsheet-specific LLMs. Results show varying performance scores ranging from 0.05% to 23.65% according to the proposed OJ-style evaluation metric. Some methods even score as low as 0%, emphasizing the difficulty of the benchmark. Overall, SpreadsheetBench stands out for its real-world instructions, diverse spreadsheet formats, and comprehensive testing strategy. The results suggest the importance of enhancing coding abilities within LLMs for spreadsheet manipulation tasks and highlight the potential benefits of multi-round prompting to improve response accuracy.
Created on 09 Oct. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.