M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models

AI-generated keywords: M3Exam LLMs Multilingualism Multimodality Multilevel

AI-generated Key Points

  • M3Exam is a benchmark for evaluating large language models (LLMs) in the context of general intelligence.
  • Human exams are considered more suitable for evaluating LLMs as they require a wider range of abilities.
  • M3Exam dataset consists of 12,317 questions in 9 diverse languages with three educational levels.
  • Approximately 23% of the questions require processing images for successful solving.
  • The dataset includes context information, main question text, candidate options, correct answers, and meta information such as language, level, subject, and image requirements.
  • The selected languages in M3Exam range from high-resource to extremely low-resource languages.
  • Various top-performing LLMs in multilingual or multimodal settings are evaluated on M3Exam.
  • Current models struggle with multilingual text, especially in low resource and non-Latin script languages.
  • Multimodal LLMs perform poorly with complex multimodal questions.
  • M3Exam addresses the limitations of task-specific benchmarks by incorporating multilingualism, multimodality, and a multilevel structure.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia, Lidong Bing

License: CC BY-NC-SA 4.0

Abstract: Despite the existence of various benchmarks for evaluating natural language processing models, we argue that human exams are a more suitable means of evaluating general intelligence for large language models (LLMs), as they inherently demand a much wider range of abilities such as language understanding, domain knowledge, and problem-solving skills. To this end, we introduce M3Exam, a novel benchmark sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. M3Exam exhibits three unique characteristics: (1) multilingualism, encompassing questions from multiple countries that require strong multilingual proficiency and cultural knowledge; (2) multimodality, accounting for the multimodal nature of many exam questions to test the model's multimodal understanding capability; and (3) multilevel structure, featuring exams from three critical educational periods to comprehensively assess a model's proficiency at different levels. In total, M3Exam contains 12,317 questions in 9 diverse languages with three educational levels, where about 23\% of the questions require processing images for successful solving. We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text, particularly in low-resource and non-Latin script languages. Multimodal LLMs also perform poorly with complex multimodal questions. We believe that M3Exam can be a valuable resource for comprehensively evaluating LLMs by examining their multilingual and multimodal abilities and tracking their development. Data and evaluation code is available at \url{https://github.com/DAMO-NLP-SG/M3Exam}.

Submitted to arXiv on 08 Jun. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2306.05179v1

This study introduces M3Exam, a novel benchmark for evaluating large language models (LLMs) in the context of general intelligence. The researchers argue that human exams are a more suitable means of evaluating LLMs as they require a wider range of abilities such as language understanding, domain knowledge, and problem-solving skills. M3Exam is sourced from real and official human exam questions and exhibits three unique characteristics: multilingualism, multimodality, and a multilevel structure. The M3Exam dataset consists of 12,317 questions in 9 diverse languages with three educational levels. Approximately 23% of the questions require processing images for successful solving. The dataset includes context information, main question text, candidate options, correct answers, and meta information such as language, level, subject and image requirements. The selected languages in M3Exam span a wide range from high-resource languages like English and Chinese to extremely low-resource languages like Javanese. This diversity allows for comprehensive assessment of the multilingual capabilities of LLMs. The ratio of questions requiring images varies across countries. For the experiment setups various top-performing LLMs in either multilingual or multimodal settings are selected including ChatGPT (gpt-3.5-turbo), GPT-4 (gpt-4), Claude (Claude-instant), BLOOM (176B) and Vicuna (13B). Some models are open source while others are obtained via API calls. The performance of these LLMs on M3Exam is evaluated which reveals that current models still struggle with multilingual text particularly in low resource and non Latin script languages while multimodal LLMs also perform poorly with complex multimodal questions. Overall this study presents M3Exam as a benchmark for evaluating LLMs' general intelligence addressing the limitations of task specific benchmarks by incorporating multilingualism multimodality and a multilevel structure. The data and evaluation code are available on GitHub making it possible to track the development of LLMs over time using this valuable resource for comprehensively assessing their capabilities.
Created on 07 Nov. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.