MERA: A Comprehensive LLM Evaluation in Russian

AI-generated keywords: Artificial Intelligence Large Language Models Multimodal Evaluation Russian-language Architectures Benchmark

AI-generated Key Points

  • Significant advancements in the field of
  • Development of foundation models (FMs) and language models (LMs)
  • Improvements in measurable aspects and introduction of new qualitative features
  • Need for better understanding of capabilities, limitations, and risks
  • Introduction of MERA benchmark for evaluating foundation models focused on the Russian language
  • Structured as a black-box test to prevent data leakage
  • Methodology for evaluating FMs and LMs in zero- and few-shot fixed instruction settings
  • Key contributions include reproducible methodology, 21 textual tasks formatted as instruction datasets, scoring system, open leaderboard, baseline solutions
  • Proposals for new benchmarks like BIG-bench, HELM, MT-Bench to evaluate LLMs in challenging settings
  • Shift towards using LLMs as judges for scoring model answers instead of relying solely on automatic metrics or human evaluation
  • Criticisms of standard metrics for generative evaluation leading to the development of benchmarks like INSTRUCTEVAL tailored specifically for instruction-tuned LLMs
  • Aim of MERA to guide future research efforts by providing standardized evaluation procedure
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Alena Fenogenova, Artem Chervyakov, Nikita Martynov, Anastasia Kozlova, Maria Tikhonova, Albina Akhmetgareeva, Anton Emelyanov, Denis Shevelev, Pavel Lebedev, Leonid Sinev, Ulyana Isaeva, Katerina Kolomeytseva, Daniil Moskovskiy, Elizaveta Goncharova, Nikita Savushkin, Polina Mikhailova, Denis Dimitrov, Alexander Panchenko, Sergei Markov

the paper version comparable with the release code v.1.1.0 of the benchmark; https://mera.a-ai.ru/en
License: CC BY 4.0

Abstract: Over the past few years, one of the most notable advancements in AI research has been in foundation models (FMs), headlined by the rise of language models (LMs). As the models' size increases, LMs demonstrate enhancements in measurable aspects and the development of new qualitative features. However, despite researchers' attention and the rapid growth in LM application, the capabilities, limitations, and associated risks still need to be better understood. To address these issues, we introduce an open Multimodal Evaluation of Russian-language Architectures (MERA), a new instruction benchmark for evaluating foundation models oriented towards the Russian language. The benchmark encompasses 21 evaluation tasks for generative models in 11 skill domains and is designed as a black-box test to ensure the exclusion of data leakage. The paper introduces a methodology to evaluate FMs and LMs in zero- and few-shot fixed instruction settings that can be extended to other modalities. We propose an evaluation methodology, an open-source code base for the MERA assessment, and a leaderboard with a submission system. We evaluate open LMs as baselines and find that they are still far behind the human level. We publicly release MERA to guide forthcoming research, anticipate groundbreaking model features, standardize the evaluation procedure, and address potential societal drawbacks.

Submitted to arXiv on 09 Jan. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2401.04531v1

The field of has seen significant advancements in , particularly with the development of . These models have shown remarkable improvements in various measurable aspects and introduced new qualitative features as their size continues to increase. However, there is still a need for a better understanding of their capabilities, limitations, and associated risks. To address these challenges, an open has been introduced. The MERA benchmark consists of 21 evaluation tasks across 11 skill domains specifically designed for evaluating foundation models focused on the Russian language. It is structured as a black-box test to prevent data leakage and includes a methodology for evaluating FMs and LMs in zero- and few-shot fixed instruction settings that can be extended to other modalities. The key contributions of this work include presenting a reproducible methodology for evaluating LLMs with a fixed experimental setup, providing 21 textual tasks formatted as instruction datasets covering text sub-modalities such as code, establishing a platform with a scoring system and an open leaderboard for LLM evaluation, and offering baseline solutions including open-source models and human baselines. In comparison to existing benchmarks like GLUE and SuperGLUE which have been criticized for being shallow and potentially outdated due to the emergence of LLMs and FMs, new benchmarks such as BIG-bench, HELM, MT-Bench are proposed to evaluate LLMs in more challenging settings. These benchmarks aim to assess models' generalization abilities across multiple languages, expert knowledge in various domains, coding skills, among other capabilities. Furthermore, there is a shift towards utilizing LLMs as judges for scoring model answers instead of relying solely on automatic metrics or human evaluation. While standard metrics for generative evaluation have been criticized for not being representative enough, benchmarks like INSTRUCTEVAL offer comprehensive evaluation methodologies tailored specifically for instruction-tuned LLMs. Overall, the introduction of MERA aims to guide future research efforts by providing a standardized evaluation procedure that anticipates groundbreaking model features while addressing potential societal drawbacks associated with AI adoption.
Created on 30 Jan. 2025

Assess the quality of the AI-generated content by voting

Score: 1

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.