A Survey on Evaluation of Large Language Models

AI-generated keywords: Large Language Models (LLMs) Evaluation Performance Success Failure

AI-generated Key Points

  • Large language models (LLMs) have remarkable capabilities in natural language processing tasks
  • The authors provide a comprehensive review of evaluation methods for LLMs
  • Three key dimensions of evaluation: what to evaluate, where to evaluate, and how to evaluate
  • Overview of evaluation tasks in various areas such as reasoning, medical usage, ethics, education, natural and social sciences, agent applications, etc.
  • Discussion on evaluation methods and benchmarks used in assessing LLM performance
  • Success and failure cases of LLMs in different tasks
  • Importance of treating evaluation as an essential discipline for developing more proficient LLMs
  • Discussion on future challenges in LLMs evaluation
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, Xing Xie

23 pages
License: CC BY 4.0

Abstract: Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas. Secondly, we answer the `where' and `how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey.

Submitted to arXiv on 06 Jul. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2307.03109v1

Large language models (LLMs) have gained significant attention in recent years due to their remarkable capabilities in natural language processing tasks. In this paper, the authors provide a comprehensive review of evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. They first provide an overview of evaluation tasks including general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications and other areas. Then they dive into the evaluation methods and benchmarks used in assessing the performance of LLMs. The paper also discusses success and failure cases of LLMs in different tasks and highlights the importance of treating evaluation as an essential discipline to aid the development of more proficient LLMs. The paper concludes with a discussion on future challenges in LLMs evaluation.
Created on 08 Dec. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.