A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

AI-generated keywords: Large Language Models

AI-generated Key Points

  • Researchers evaluated large language models (LLMs) in natural language generation (NLG) tasks
  • Evaluated models included ChatGPT, ChatGLM, T5-based models, LLaMA-based models, and Pythia-based models
  • Used EmpatheticDialogue dataset for Dialogue Generation and CNN/DailyMail dataset along with XSum dataset for Text Summarization
  • Employed LCCC dataset for Chinese dialogue generation and THUCNews along with LCSTS for Chinese text summarization
  • Highlighted capabilities of Transformer-based LLMs with billions of parameters in understanding natural language and solving complex tasks
  • Noted few-shot learning properties of these models allowing them to perform new tasks based on textual instructions or minimal examples
  • Discussed trends in scaling these models further to enhance their capabilities
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xuanfan Ni, Piji Li

CCL2023
License: CC BY 4.0

Abstract: Recent efforts have evaluated large language models (LLMs) in areas such as commonsense reasoning, mathematical reasoning, and code generation. However, to the best of our knowledge, no work has specifically investigated the performance of LLMs in natural language generation (NLG) tasks, a pivotal criterion for determining model excellence. Thus, this paper conducts a comprehensive evaluation of well-known and high-performing LLMs, namely ChatGPT, ChatGLM, T5-based models, LLaMA-based models, and Pythia-based models, in the context of NLG tasks. We select English and Chinese datasets encompassing Dialogue Generation and Text Summarization. Moreover, we propose a common evaluation setting that incorporates input templates and post-processing strategies. Our study reports both automatic results, accompanied by a detailed analysis.

Submitted to arXiv on 16 May. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2405.10251v1

, , , , In this study, the researchers conducted a comprehensive evaluation of large language models (LLMs) in the context of natural language generation (NLG) tasks. The evaluation included well-known and high-performing LLMs such as ChatGPT, ChatGLM, T5-based models, LLaMA-based models, and Pythia-based models. The English dataset used for Dialogue Generation was EmpatheticDialogue, which consists of 25k empathetic conversations between a speaker and a listener. For Text Summarization, the researchers utilized the CNN/DailyMail dataset containing 93k and 220k articles from CNN and Daily Mail websites respectively, along with the XSum dataset which includes extreme summarization examples from BBC articles. Furthermore, for Chinese NLG tasks, the researchers employed the LCCC dataset for dialogue generation and THUCNews along with LCSTS for text summarization. The evaluation settings incorporated input templates and post-processing strategies to provide a common ground for comparison across different LLMs. The study reported both automatic results and detailed analysis to assess the performance of these models in NLG tasks. The paper highlighted that large language models (LLMs), particularly Transformer-based models with billions of parameters trained on extensive text corpora, have shown significant capabilities in understanding natural language and solving complex tasks. These models have demonstrated few-shot learning properties where they can perform new tasks based on textual instructions or minimal examples. The research also discussed trends in scaling these models further to enhance their capabilities. In conclusion, the comprehensive assessment of various LLMs in NLG tasks revealed notable trends and phenomena that shed light on their multilingual capabilities. The results from both automatic evaluations and manual analyses provided valuable insights into the performance of these models across different languages and datasets.
Created on 20 Oct. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.