Benchmarking Large Language Models for News Summarization

AI-generated keywords: LLMs Automatic Summarization Instruction Tuning High-Quality References Freelance Writers

AI-generated Key Points

  • Large language models (LLMs) have potential in automatic summarization
  • Human evaluation conducted on 10 LLMs across different pretraining methods, prompts, and model scales
  • Instruction tuning is key to LLM's zero-shot summarization capability rather than model size
  • Existing studies limited by low-quality references leading to underestimates of human performance and lower few-shot and fine-tuning performance
  • High-quality summaries collected from freelance writers for human evaluation
  • Quality of freelance writer summaries evaluated using Mechanical Turkers found to be much higher than original reference summaries in CNN/DM and XSUM
  • Little difference between quality of freelance writer summaries and those generated by Instruct Davinci model
  • LLM-generated summaries and freelance writer-generated summaries had distinctive styles with regard to paraphrasing and copying from source articles
  • Annotators recruited to compare Instruct Davinci-generated summaries with those written by freelance writers; overall, Instruct Davinci rated as comparable to freelance writers but individual annotators showed varying preferences
  • Instruction tuning crucial for LLMs' summarization capability rather than model scale
  • Issues with low-quality references used in previous studies highlighted
  • Proposed collecting better quality summaries from freelance writers as a solution
  • Findings contribute towards improving LLM evaluation techniques for future research in automatic summarization.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, Tatsunori B. Hashimoto

License: CC BY 4.0

Abstract: Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LMM summaries are judged to be on par with human written summaries.

Submitted to arXiv on 31 Jan. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2301.13848v1

Large language models (LLMs) have shown great potential in automatic summarization, but the reasons behind their success are not well understood. To address this issue, a human evaluation was conducted on ten LLMs across different pretraining methods, prompts, and model scales. The study made two important observations: first, instruction tuning is the key to the LLM's zero-shot summarization capability rather than model size; second, existing studies have been limited by low-quality references leading to underestimates of human performance and lower few-shot and fine-tuning performance. To better evaluate LLMs, high-quality summaries were collected from freelance writers for human evaluation. To ensure consistency in summary length between the freelance writer summaries and those generated by Instruct Davinci model, a new prompt was introduced that elicited summaries around 50 words long. The quality of the freelance writer summaries was evaluated using Mechanical Turkers and found to be much higher than the original reference summaries in CNN/DM and XSUM. Additionally, there was little difference between the quality of the freelance writer summaries and those generated by Instruct Davinci. Despite similar performance in quality control studies, LLM-generated summaries and freelance writer-generated summaries had distinctive styles with regard to paraphrasing and copying from source articles. Extractiveness measures were used to compare coverage and density between these two types of summaries. Annotators were recruited to compare Instruct Davinci-generated summaries with those written by freelance writers. On aggregate, Instruct Davinci was rated as comparable to freelance writers; however, individual annotators showed varying preferences for either Instruct Davinci or freelance writers. Overall, this study identified instruction tuning as crucial for LLMs' summarization capability rather than model scale. It also highlighted issues with low-quality references used in previous studies and proposed collecting better quality summaries from freelance writers as a solution. These findings contribute towards improving LLM evaluation techniques for future research in automatic summarization.
Created on 16 May. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.