Benchmarking Large Language Models for News Summarization

AI-generated keywords: LLMs Automatic Summarization Instruction Tuning High-Quality References Freelance Writers

AI-generated Key Points

Large language models (LLMs) have potential in automatic summarization
Human evaluation conducted on 10 LLMs across different pretraining methods, prompts, and model scales
Instruction tuning is key to LLM's zero-shot summarization capability rather than model size
Existing studies limited by low-quality references leading to underestimates of human performance and lower few-shot and fine-tuning performance
High-quality summaries collected from freelance writers for human evaluation
Quality of freelance writer summaries evaluated using Mechanical Turkers found to be much higher than original reference summaries in CNN/DM and XSUM
Little difference between quality of freelance writer summaries and those generated by Instruct Davinci model
LLM-generated summaries and freelance writer-generated summaries had distinctive styles with regard to paraphrasing and copying from source articles
Annotators recruited to compare Instruct Davinci-generated summaries with those written by freelance writers; overall, Instruct Davinci rated as comparable to freelance writers but individual annotators showed varying preferences
Instruction tuning crucial for LLMs' summarization capability rather than model scale
Issues with low-quality references used in previous studies highlighted
Proposed collecting better quality summaries from freelance writers as a solution
Findings contribute towards improving LLM evaluation techniques for future research in automatic summarization.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, Tatsunori B. Hashimoto

arXiv: 2301.13848v1 - DOI (cs.CL)

License: CC BY 4.0

Abstract: Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LMM summaries are judged to be on par with human written summaries.

Submitted to arXiv on 31 Jan. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2301.13848v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

Large language models (LLMs) have shown great potential in automatic summarization, but the reasons behind their success are not well understood. To address this issue, a human evaluation was conducted on ten LLMs across different pretraining methods, prompts, and model scales. The study made two important observations: first, instruction tuning is the key to the LLM's zero-shot summarization capability rather than model size; second, existing studies have been limited by low-quality references leading to underestimates of human performance and lower few-shot and fine-tuning performance. To better evaluate LLMs, high-quality summaries were collected from freelance writers for human evaluation. To ensure consistency in summary length between the freelance writer summaries and those generated by Instruct Davinci model, a new prompt was introduced that elicited summaries around 50 words long. The quality of the freelance writer summaries was evaluated using Mechanical Turkers and found to be much higher than the original reference summaries in CNN/DM and XSUM. Additionally, there was little difference between the quality of the freelance writer summaries and those generated by Instruct Davinci. Despite similar performance in quality control studies, LLM-generated summaries and freelance writer-generated summaries had distinctive styles with regard to paraphrasing and copying from source articles. Extractiveness measures were used to compare coverage and density between these two types of summaries. Annotators were recruited to compare Instruct Davinci-generated summaries with those written by freelance writers. On aggregate, Instruct Davinci was rated as comparable to freelance writers; however, individual annotators showed varying preferences for either Instruct Davinci or freelance writers. Overall, this study identified instruction tuning as crucial for LLMs' summarization capability rather than model scale. It also highlighted issues with low-quality references used in previous studies and proposed collecting better quality summaries from freelance writers as a solution. These findings contribute towards improving LLM evaluation techniques for future research in automatic summarization.

- Large language models (LLMs) have potential in automatic summarization
- Human evaluation conducted on 10 LLMs across different pretraining methods, prompts, and model scales
- Instruction tuning is key to LLM's zero-shot summarization capability rather than model size
- Existing studies limited by low-quality references leading to underestimates of human performance and lower few-shot and fine-tuning performance
- High-quality summaries collected from freelance writers for human evaluation
- Quality of freelance writer summaries evaluated using Mechanical Turkers found to be much higher than original reference summaries in CNN/DM and XSUM
- Little difference between quality of freelance writer summaries and those generated by Instruct Davinci model
- LLM-generated summaries and freelance writer-generated summaries had distinctive styles with regard to paraphrasing and copying from source articles
- Annotators recruited to compare Instruct Davinci-generated summaries with those written by freelance writers; overall, Instruct Davinci rated as comparable to freelance writers but individual annotators showed varying preferences
- Instruction tuning crucial for LLMs' summarization capability rather than model scale
- Issues with low-quality references used in previous studies highlighted
- Proposed collecting better quality summaries from freelance writers as a solution
- Findings contribute towards improving LLM evaluation techniques for future research in automatic summarization.

Summary: Large language models (LLMs) can help summarize text. People tested 10 different LLMs to see how well they worked. They found that tuning the instructions is more important than making the model bigger. Some previous studies had bad examples to compare with, so they weren't accurate. Freelance writers made better summaries than the bad examples. The Instruct Davinci model also made good summaries, but it was a little different from what people wrote. Definitions: - Large language models (LLMs): computer programs that can understand and generate human language - Automatic summarization: using a computer program to create a short summary of a longer piece of text - Pretraining methods: ways of teaching an LLM how to understand language before it starts summarizing - Zero-shot summarization: when an LLM creates a summary without being specifically trained on that topic - Few-shot and fine-tuning performance: how well an LLM can make summaries after being trained on just a few examples or after being adjusted slightly

Exploring the Success of Large Language Models in Automatic Summarization

The development of large language models (LLMs) has revolutionized natural language processing and enabled machines to generate human-like summaries. However, the reasons behind their success are not well understood. To address this issue, a recent study conducted a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales. The results of this evaluation revealed two important observations about LLMs: instruction tuning is the key to the LLM's zero-shot summarization capability rather than model size; and existing studies have been limited by low-quality references leading to underestimates of human performance and lower few-shot and fine-tuning performance.

Collecting High Quality Summaries from Freelance Writers

To better evaluate LLMs, high-quality summaries were collected from freelance writers for human evaluation. To ensure consistency in summary length between the freelance writer summaries and those generated by Instruct Davinci model, a new prompt was introduced that elicited summaries around 50 words long. The quality of the freelance writer summaries was evaluated using Mechanical Turkers and found to be much higher than the original reference summaries in CNN/DM and XSUM datasets. Additionally, there was little difference between the quality of the freelance writer summaries and those generated by Instruct Davinci.

Distinctive Styles Between Freelance Writer Summaries & LLM Generated Summaries

Despite similar performance in quality control studies, it was observed that there were distinctive styles with regard to paraphrasing and copying from source articles between LLM-generated summaries and freelance writer-generated ones. Extractiveness measures were used to compare coverage density between these two types of summaries as well as recruit annotators who compared Instruct Davinci-generated summaries with those written by freelance writers. On aggregate, Instruct Davinci was rated as comparable to freelance writers; however individual annotators showed varying preferences for either Instruct Davinci or freelance writers when comparing them side by side..

Conclusion

Overall, this study identified instruction tuning as crucial for LLMs' summarization capability rather than model scale. It also highlighted issues with low-quality references used in previous studies which led researchers into underestimating human performance levels when evaluating automatic summarization systems based on such references - proposing collecting better quality reference material from freelancers instead as an effective solution towards improving future research efforts within this field . These findings contribute towards improving LLM evaluation techniques for future research in automatic summarization while providing valuable insights into how best utilize large language models for generating accurate yet concise machine generated text outputs that can rival even professional level writing standards at times!

Created on 16 May. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

64.8%

Training a Helpful and Harmless Assistant with Reinforcement Learning from Hu…

cs.CL

61.5%

ImpressionGPT: An Iterative Optimizing Framework for Radiology Report Summari…

cs.CL

61.2%

Instruction Tuning with GPT-4

cs.CL

61.0%

How Useful are Educational Questions Generated by Large Language Models?

cs.CL

59.3%

LLaMA: Open and Efficient Foundation Language Models

cs.CL

58.7%

ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language …

cs.CL

57.2%

Read Top News First: A Document Reordering Approach for Multi-Document News S…

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.