In this study, the researchers introduce SpreadsheetBench, a challenging benchmark specifically designed to test the capabilities of large language models (LLMs) in manipulating real-world spreadsheet scenarios. Unlike existing benchmarks that rely on synthesized queries and simplified spreadsheet files, SpreadsheetBench is constructed from 912 real questions sourced from online Excel forums. This reflects the complex needs and challenges faced by users. The associated spreadsheets in the benchmark contain diverse tabular data structures such as multiple tables, non-standard relational tables, and various non-textual elements. The researchers propose a more reliable evaluation metric inspired by online judge platforms. This involves creating multiple spreadsheet files as test cases for each instruction to ensure robust solutions capable of handling different types of spreadsheets. Through a comprehensive evaluation of various LLMs under both single-round and multi-round inference settings, the study reveals a significant performance gap between state-of-the-art models and human users. This underscores the difficulty of the benchmark. Furthermore, comparisons are made between previous benchmarks like SheetCopilotBench and SpreadsheetBench. These highlight the complexity and realism of instructions in SpreadsheetBench. The spreadsheets in SpreadsheetBench feature flexible data organization with non-standard relational tables and cells containing textual information and non-textual elements like colors. Additionally, manipulation categories involve spreadsheets with tables extending beyond 100 columns and 20,000 rows. The study evaluates different types of models including TableQA models, open-source LLMs for general tasks and coding tasks, advanced close-source models, and spreadsheet-specific LLMs. Results show varying performance scores ranging from 0.05% to 23.65% according to the proposed OJ-style evaluation metric. Some methods even score as low as 0%, emphasizing the difficulty of the benchmark. Overall, SpreadsheetBench stands out for its real-world instructions, diverse spreadsheet formats, and comprehensive testing strategy. The results suggest the importance of enhancing coding abilities within LLMs for spreadsheet manipulation tasks and highlight the potential benefits of multi-round prompting to improve response accuracy.
- - SpreadsheetBench is a benchmark designed to test large language models (LLMs) in manipulating real-world spreadsheet scenarios
- - Constructed from 912 real questions sourced from online Excel forums, reflecting complex user needs and challenges
- - Benchmark includes diverse tabular data structures such as multiple tables, non-standard relational tables, and non-textual elements
- - Proposed evaluation metric involves creating multiple spreadsheet files as test cases for each instruction to ensure robust solutions
- - Significant performance gap between state-of-the-art models and human users revealed through comprehensive evaluation under single-round and multi-round inference settings
- - Comparison with previous benchmarks like SheetCopilotBench highlights complexity and realism of instructions in SpreadsheetBench
- - Spreadsheets in SpreadsheetBench feature flexible data organization with non-standard relational tables, textual information, and non-textual elements like colors
- - Evaluation includes different types of models showing varying performance scores ranging from 0.05% to 23.65%
- - Importance of enhancing coding abilities within LLMs for spreadsheet manipulation tasks emphasized
- - Potential benefits of multi-round prompting to improve response accuracy highlighted
SummarySpreadsheetBench is a test to see how well big computer programs can work with real-life spreadsheets. It uses 912 real questions from the internet about Excel to make sure it's challenging. The test includes different kinds of data like tables and non-text items. To check if the programs are good, they have to solve many spreadsheet problems in different files. The test shows that computers still need to get better at this compared to people.
Definitions- Benchmark: A standard or point of reference used for comparison or evaluation.
- Language models (LLMs): Computer programs designed to understand and generate human language.
- Tabular data structures: Data arranged in rows and columns like a table.
- Relational tables: Data tables that are connected or related to each other through common fields.
- Evaluation metric: A way to measure or judge how well something performs.
- Instruction: A set of steps or rules given for completing a task accurately.
- Inference settings: Conditions under which conclusions are drawn based on available information.
- Prompting: Providing cues or hints to guide responses or actions.
- Coding abilities: Skills related to writing, understanding, and using computer programming code.
Introduction
In today's digital age, spreadsheets have become an essential tool for data analysis and manipulation. With the increasing complexity of data and tasks, there is a growing demand for more advanced spreadsheet tools that can handle real-world scenarios efficiently. This has led to the development of large language models (LLMs) capable of understanding natural language instructions and performing complex spreadsheet operations.
However, evaluating the performance of LLMs in spreadsheet manipulation tasks has been a challenge due to the lack of realistic benchmarks. Existing benchmarks rely on synthesized queries and simplified spreadsheet files, which do not accurately reflect the complexities faced by users in real-world scenarios. To address this issue, a team of researchers introduced SpreadsheetBench - a challenging benchmark specifically designed to test the capabilities of LLMs in manipulating real-world spreadsheet scenarios.
The Study
The study introduces SpreadsheetBench as a comprehensive evaluation metric inspired by online judge platforms. The benchmark consists of 912 real questions sourced from online Excel forums, reflecting the diverse needs and challenges faced by users. These questions cover various categories such as data cleaning, formula writing, conditional formatting, and table manipulation.
Unlike previous benchmarks like SheetCopilotBench which only evaluate single-round inference settings with one input-output pair per question, SpreadsheetBench includes multi-round inference settings with multiple input-output pairs per question. This allows for a more thorough evaluation of LLMs' abilities to handle complex instructions and manipulate different types of spreadsheets.
Additionally, SpreadsheetBench features diverse tabular data structures such as multiple tables, non-standard relational tables, and various non-textual elements like colors. The associated spreadsheets contain flexible data organization with cells containing textual information along with non-textual elements like images or charts.
Evaluation Process
To ensure robust solutions capable of handling different types of spreadsheets effectively, the researchers propose creating multiple spreadsheet files as test cases for each instruction in SpreadsheetBench. This approach provides a more reliable evaluation metric compared to previous benchmarks, which only evaluate the final output of a single spreadsheet file.
The study evaluates various types of LLMs, including TableQA models, open-source LLMs for general tasks and coding tasks, advanced close-source models, and spreadsheet-specific LLMs. The results are measured using the proposed OJ-style evaluation metric, which calculates performance scores based on the number of correct outputs out of 100 test cases.
Results
The results of the study reveal a significant performance gap between state-of-the-art LLMs and human users. The average performance score for human users is 95%, while some methods in SpreadsheetBench scored as low as 0%. This highlights the difficulty and complexity of the benchmark.
Furthermore, comparisons are made between SpreadsheetBench and previous benchmarks like SheetCopilotBench. These comparisons demonstrate that SpreadsheetBench provides more realistic instructions and diverse spreadsheet formats for evaluating LLMs' capabilities accurately.
Implications
The results of this study have several implications for both researchers and developers working on improving LLMs' abilities in spreadsheet manipulation tasks. Firstly, it emphasizes the importance of enhancing coding abilities within LLMs to handle complex real-world scenarios effectively. Secondly, it highlights the potential benefits of multi-round prompting in improving response accuracy by providing multiple input-output pairs per question.
Conclusion
In conclusion, SpreadsheetBench stands out as a comprehensive benchmark for evaluating LLMs' capabilities in manipulating real-world spreadsheet scenarios. Its inclusion of multi-round inference settings with diverse tabular data structures makes it a challenging yet accurate representation of user needs and challenges. The results from this study suggest that there is still room for improvement in LLMs' abilities to handle complex instructions accurately. Future research could focus on developing more advanced techniques to enhance coding abilities within these models further.