Comparing Abstractive Summaries Generated by ChatGPT to Real Summaries Through Blinded Reviewers and Text Classification Algorithms

AI-generated keywords: Large Language Models ChatGPT Abstractive Summarization Automated Metrics Human Reviewers

AI-generated Key Points

  • Large Language Models (LLMs) have gained significant attention for their exceptional performance across various tasks.
  • ChatGPT, developed by OpenAI, is a recent addition to the LLM family and has been hailed as a disruptive technology due to its human-like text generation capabilities.
  • A study focused on evaluating ChatGPT's performance in Abstractive Summarization using automated metrics and blinded human reviewers.
  • Limitations of the study included comparing only 50 summaries, not exploring different prompts for generating summaries, lack of comparison with other models or baselines, reliance on native English-speaking reviewers, and potential for improving automatic summary detection accuracy through more advanced algorithms.
  • Text classification algorithms could differentiate between real and generated summaries, but human reviewers struggled to distinguish between them due to intentional selection of prompts closely resembling original summaries.
  • The study achieved a 90% accuracy rate in identifying ChatGPT-generated summaries.
  • Previous studies have evaluated ChatGPT's performance in various tasks such as machine translation and medical examinations, showing competitive results in some areas but limitations in others.
  • Interactive approaches in summarization have shown improvements in ROUGE scores while revealing both believable outputs and detectable differences by AI tools and skeptical human reviewers when compared with original content.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Mayank Soni, Vincent Wade

License: CC BY 4.0

Abstract: Large Language Models (LLMs) have gathered significant attention due to their impressive performance on a variety of tasks. ChatGPT, developed by OpenAI, is a recent addition to the family of language models and is being called a disruptive technology by a few, owing to its human-like text-generation capabilities. Although, many anecdotal examples across the internet have evaluated ChatGPT's strength and weakness, only a few systematic research studies exist. To contribute to the body of literature of systematic research on ChatGPT, we evaluate the performance of ChatGPT on Abstractive Summarization by the means of automated metrics and blinded human reviewers. We also build automatic text classifiers to detect ChatGPT generated summaries. We found that while text classification algorithms can distinguish between real and generated summaries, humans are unable to distinguish between real summaries and those produced by ChatGPT.

Submitted to arXiv on 30 Mar. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2303.17650v3

Large Language Models (LLMs) have gained significant attention for their exceptional performance across various tasks. ChatGPT, developed by OpenAI, is a recent addition to the LLM family and has been hailed as a disruptive technology due to its human-like text generation capabilities. While anecdotal examples on the internet have highlighted both the strengths and weaknesses of ChatGPT, there is a limited amount of systematic research available. To contribute to the existing literature on ChatGPT, this study focused on evaluating its performance in Abstractive Summarization using automated metrics and blinded human reviewers. The study had limitations, including a restriction on comparing only 50 summaries, not exploring different prompts for generating summaries, lack of comparison with other models or baselines, reliance on native English-speaking reviewers, and potential for improving automatic summary detection accuracy through more advanced algorithms. The results revealed that while text classification algorithms could differentiate between real and generated summaries, human reviewers struggled to distinguish between them. Reviewers were uncertain about whether a summary was produced by ChatGPT or a human writer. This difficulty was attributed to the lack of distinguishing features between the two sources, which was intentional in selecting prompts that closely resembled original summaries. Additionally, the study achieved a 90% accuracy rate in identifying ChatGPT-generated summaries. In related work, previous studies have evaluated ChatGPT's performance in various tasks such as machine translation and medical examinations. These evaluations have shown competitive results in some areas but also highlighted limitations in others. In summarization specifically, interactive approaches have shown improvements in ROUGE scores, while comparisons with original content have revealed both believable outputs and detectable differences by AI tools and skeptical human reviewers. Summarization itself involves shortening large texts while preserving key information through extractive or abstractive methods. The study utilized a specific dataset for evaluation purposes and identified areas for future research improvement such as exploring different prompts for generating summaries and comparing ChatGPT's performance with other models. Overall, this study contributes valuable insights into ChatGPT's capabilities in abstractive summarization and highlights the challenges faced by both automated algorithms and human reviewers in distinguishing between machine-generated content and human-written text.
Created on 13 Sep. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.