News Summarization and Evaluation in the Era of GPT-3

AI-generated keywords: GPT-3 Text Summarization News Summarization A/B Test Evaluation Metrics

AI-generated Key Points

  • Recent advancements in large language models like GPT-3 have revolutionized NLP research
  • The impact of GPT-3 on text summarization, specifically news summarization, is explored in this study
  • GPT-3 summaries prompted using only a task description are preferred by humans and do not suffer from common dataset-specific issues such as poor factuality
  • An A/B test was designed to collect preference annotations from human annotators for three summarization systems - BRIO, T0, and GPT3-D2
  • Both reference-based and reference-free automatic metrics cannot reliably evaluate GPT-3 summaries
  • The study evaluated models beyond generic summarization by focusing on keyword-based summarization and compared dominant fine-tuning approaches to prompting
  • The authors released a corpus of 10K generated summaries from fine-tuned and prompt-based models across four standard summarization benchmarks along with 1K human preference judgments comparing different systems for generic and keyword-based summarization.
  • This study highlights the potential of prompt-based GPT-3 models in generating high quality news summaries without suffering from common dataset specific issues while also highlighting the limitations of current evaluation metrics in assessing their performance accurately.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Tanya Goyal, Junyi Jessy Li, Greg Durrett

All data shared at: https://tagoyal.github.io/zeroshot-news-annotations.html
License: CC BY 4.0

Abstract: The recent success of prompting large language models like GPT-3 has led to a paradigm shift in NLP research. In this paper, we study its impact on text summarization, focusing on the classic benchmark domain of news summarization. First, we investigate how GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality. Next, we study what this means for evaluation, particularly the role of gold standard test sets. Our experiments show that both reference-based and reference-free automatic metrics cannot reliably evaluate GPT-3 summaries. Finally, we evaluate models on a setting beyond generic summarization, specifically keyword-based summarization, and show how dominant fine-tuning approaches compare to prompting. To support further research, we release: (a) a corpus of 10K generated summaries from fine-tuned and prompt-based models across 4 standard summarization benchmarks, (b) 1K human preference judgments comparing different systems for generic- and keyword-based summarization.

Submitted to arXiv on 26 Sep. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2209.12356v2

The recent advancements in large language models like GPT-3 have revolutionized the field of natural language processing (NLP) research. In this study, the impact of GPT-3 on text summarization is explored, with a focus on news summarization as a classic benchmark domain. The study investigates how GPT-3 compares against fine-tuned models trained on large summarization datasets and shows that not only do humans overwhelmingly prefer GPT-3 summaries prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality. To evaluate the performance of different summarization systems, an A/B test was designed to collect preference annotations from human annotators. The annotators were shown summaries from three summarization systems - BRIO, T0, and GPT3-D2 - for each given article and asked to select their most and least preferred summary or summaries based on their own preferences as non-expert consumers of news summaries. The study found that both reference-based and reference-free automatic metrics cannot reliably evaluate GPT-3 summaries. Additionally, the study evaluated models beyond generic summarization by focusing on keyword-based summarization and compared dominant fine-tuning approaches to prompting. To support further research in this area, the authors released a corpus of 10K generated summaries from fine-tuned and prompt-based models across four standard summarization benchmarks along with 1K human preference judgments comparing different systems for generic and keyword-based summarization. Overall, this study highlights the potential of prompt-based GPT-3 models in generating high quality news summaries without suffering from common dataset specific issues while also highlighting the limitations of current evaluation metrics in assessing their performance accurately.
Created on 18 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 1

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.