SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation

AI-generated keywords: SpreadsheetBench benchmark large language models (LLMs) evaluation metric multi-round prompting

AI-generated Key Points

SpreadsheetBench is a benchmark designed to test large language models (LLMs) in manipulating real-world spreadsheet scenarios
Constructed from 912 real questions sourced from online Excel forums, reflecting complex user needs and challenges
Benchmark includes diverse tabular data structures such as multiple tables, non-standard relational tables, and non-textual elements
Proposed evaluation metric involves creating multiple spreadsheet files as test cases for each instruction to ensure robust solutions
Significant performance gap between state-of-the-art models and human users revealed through comprehensive evaluation under single-round and multi-round inference settings
Comparison with previous benchmarks like SheetCopilotBench highlights complexity and realism of instructions in SpreadsheetBench
Spreadsheets in SpreadsheetBench feature flexible data organization with non-standard relational tables, textual information, and non-textual elements like colors
Evaluation includes different types of models showing varying performance scores ranging from 0.05% to 23.65%
Importance of enhancing coding abilities within LLMs for spreadsheet manipulation tasks emphasized
Potential benefits of multi-round prompting to improve response accuracy highlighted

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, Jie Tang

arXiv: 2406.14991v1 - DOI (cs.CL)

Homepage: https://spreadsheetbench.github.io/

License: CC BY-SA 4.0

Abstract: We introduce SpreadsheetBench, a challenging spreadsheet manipulation benchmark exclusively derived from real-world scenarios, designed to immerse current large language models (LLMs) in the actual workflow of spreadsheet users. Unlike existing benchmarks that rely on synthesized queries and simplified spreadsheet files, SpreadsheetBench is built from 912 real questions gathered from online Excel forums, which reflect the intricate needs of users. The associated spreadsheets from the forums contain a variety of tabular data such as multiple tables, non-standard relational tables, and abundant non-textual elements. Furthermore, we propose a more reliable evaluation metric akin to online judge platforms, where multiple spreadsheet files are created as test cases for each instruction, ensuring the evaluation of robust solutions capable of handling spreadsheets with varying values. Our comprehensive evaluation of various LLMs under both single-round and multi-round inference settings reveals a substantial gap between the state-of-the-art (SOTA) models and human performance, highlighting the benchmark's difficulty.

Submitted to arXiv on 21 Jun. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2406.14991v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

In this study, the researchers introduce SpreadsheetBench, a challenging benchmark specifically designed to test the capabilities of large language models (LLMs) in manipulating real-world spreadsheet scenarios. Unlike existing benchmarks that rely on synthesized queries and simplified spreadsheet files, SpreadsheetBench is constructed from 912 real questions sourced from online Excel forums. This reflects the complex needs and challenges faced by users. The associated spreadsheets in the benchmark contain diverse tabular data structures such as multiple tables, non-standard relational tables, and various non-textual elements. The researchers propose a more reliable evaluation metric inspired by online judge platforms. This involves creating multiple spreadsheet files as test cases for each instruction to ensure robust solutions capable of handling different types of spreadsheets. Through a comprehensive evaluation of various LLMs under both single-round and multi-round inference settings, the study reveals a significant performance gap between state-of-the-art models and human users. This underscores the difficulty of the benchmark. Furthermore, comparisons are made between previous benchmarks like SheetCopilotBench and SpreadsheetBench. These highlight the complexity and realism of instructions in SpreadsheetBench. The spreadsheets in SpreadsheetBench feature flexible data organization with non-standard relational tables and cells containing textual information and non-textual elements like colors. Additionally, manipulation categories involve spreadsheets with tables extending beyond 100 columns and 20,000 rows. The study evaluates different types of models including TableQA models, open-source LLMs for general tasks and coding tasks, advanced close-source models, and spreadsheet-specific LLMs. Results show varying performance scores ranging from 0.05% to 23.65% according to the proposed OJ-style evaluation metric. Some methods even score as low as 0%, emphasizing the difficulty of the benchmark. Overall, SpreadsheetBench stands out for its real-world instructions, diverse spreadsheet formats, and comprehensive testing strategy. The results suggest the importance of enhancing coding abilities within LLMs for spreadsheet manipulation tasks and highlight the potential benefits of multi-round prompting to improve response accuracy.

- SpreadsheetBench is a benchmark designed to test large language models (LLMs) in manipulating real-world spreadsheet scenarios
- Constructed from 912 real questions sourced from online Excel forums, reflecting complex user needs and challenges
- Benchmark includes diverse tabular data structures such as multiple tables, non-standard relational tables, and non-textual elements
- Proposed evaluation metric involves creating multiple spreadsheet files as test cases for each instruction to ensure robust solutions
- Significant performance gap between state-of-the-art models and human users revealed through comprehensive evaluation under single-round and multi-round inference settings
- Comparison with previous benchmarks like SheetCopilotBench highlights complexity and realism of instructions in SpreadsheetBench
- Spreadsheets in SpreadsheetBench feature flexible data organization with non-standard relational tables, textual information, and non-textual elements like colors
- Evaluation includes different types of models showing varying performance scores ranging from 0.05% to 23.65%
- Importance of enhancing coding abilities within LLMs for spreadsheet manipulation tasks emphasized
- Potential benefits of multi-round prompting to improve response accuracy highlighted

SummarySpreadsheetBench is a test to see how well big computer programs can work with real-life spreadsheets. It uses 912 real questions from the internet about Excel to make sure it's challenging. The test includes different kinds of data like tables and non-text items. To check if the programs are good, they have to solve many spreadsheet problems in different files. The test shows that computers still need to get better at this compared to people. Definitions- Benchmark: A standard or point of reference used for comparison or evaluation. - Language models (LLMs): Computer programs designed to understand and generate human language. - Tabular data structures: Data arranged in rows and columns like a table. - Relational tables: Data tables that are connected or related to each other through common fields. - Evaluation metric: A way to measure or judge how well something performs. - Instruction: A set of steps or rules given for completing a task accurately. - Inference settings: Conditions under which conclusions are drawn based on available information. - Prompting: Providing cues or hints to guide responses or actions. - Coding abilities: Skills related to writing, understanding, and using computer programming code.

Introduction In today's digital age, spreadsheets have become an essential tool for data analysis and manipulation. With the increasing complexity of data and tasks, there is a growing demand for more advanced spreadsheet tools that can handle real-world scenarios efficiently. This has led to the development of large language models (LLMs) capable of understanding natural language instructions and performing complex spreadsheet operations. However, evaluating the performance of LLMs in spreadsheet manipulation tasks has been a challenge due to the lack of realistic benchmarks. Existing benchmarks rely on synthesized queries and simplified spreadsheet files, which do not accurately reflect the complexities faced by users in real-world scenarios. To address this issue, a team of researchers introduced SpreadsheetBench - a challenging benchmark specifically designed to test the capabilities of LLMs in manipulating real-world spreadsheet scenarios. The Study The study introduces SpreadsheetBench as a comprehensive evaluation metric inspired by online judge platforms. The benchmark consists of 912 real questions sourced from online Excel forums, reflecting the diverse needs and challenges faced by users. These questions cover various categories such as data cleaning, formula writing, conditional formatting, and table manipulation. Unlike previous benchmarks like SheetCopilotBench which only evaluate single-round inference settings with one input-output pair per question, SpreadsheetBench includes multi-round inference settings with multiple input-output pairs per question. This allows for a more thorough evaluation of LLMs' abilities to handle complex instructions and manipulate different types of spreadsheets. Additionally, SpreadsheetBench features diverse tabular data structures such as multiple tables, non-standard relational tables, and various non-textual elements like colors. The associated spreadsheets contain flexible data organization with cells containing textual information along with non-textual elements like images or charts. Evaluation Process To ensure robust solutions capable of handling different types of spreadsheets effectively, the researchers propose creating multiple spreadsheet files as test cases for each instruction in SpreadsheetBench. This approach provides a more reliable evaluation metric compared to previous benchmarks, which only evaluate the final output of a single spreadsheet file. The study evaluates various types of LLMs, including TableQA models, open-source LLMs for general tasks and coding tasks, advanced close-source models, and spreadsheet-specific LLMs. The results are measured using the proposed OJ-style evaluation metric, which calculates performance scores based on the number of correct outputs out of 100 test cases. Results The results of the study reveal a significant performance gap between state-of-the-art LLMs and human users. The average performance score for human users is 95%, while some methods in SpreadsheetBench scored as low as 0%. This highlights the difficulty and complexity of the benchmark. Furthermore, comparisons are made between SpreadsheetBench and previous benchmarks like SheetCopilotBench. These comparisons demonstrate that SpreadsheetBench provides more realistic instructions and diverse spreadsheet formats for evaluating LLMs' capabilities accurately. Implications The results of this study have several implications for both researchers and developers working on improving LLMs' abilities in spreadsheet manipulation tasks. Firstly, it emphasizes the importance of enhancing coding abilities within LLMs to handle complex real-world scenarios effectively. Secondly, it highlights the potential benefits of multi-round prompting in improving response accuracy by providing multiple input-output pairs per question. Conclusion In conclusion, SpreadsheetBench stands out as a comprehensive benchmark for evaluating LLMs' capabilities in manipulating real-world spreadsheet scenarios. Its inclusion of multi-round inference settings with diverse tabular data structures makes it a challenging yet accurate representation of user needs and challenges. The results from this study suggest that there is still room for improvement in LLMs' abilities to handle complex instructions accurately. Future research could focus on developing more advanced techniques to enhance coding abilities within these models further.

Created on 09 Oct. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

57.2%

Table Meets LLM: Can Large Language Models Understand Structured Table Data? …

cs.CL

54.3%

A Comprehensive Overview of Large Language Models

cs.CL

53.9%

Making Science Simple: Corpora for the Lay Summarisation of Scientific Litera…

cs.CL

52.8%

Yi: Open Foundation Models by 01.AI

cs.CL

52.5%

Instruction Tuning with GPT-4

cs.CL

52.1%

Large Language Models on Tabular Data -- A Survey

cs.CL

52.0%

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domai…

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.