Podcast Summary Assessment: A Resource for Evaluating Summary Assessment Methods

AI-generated keywords: Automatic Summary Assessment Evaluation Dataset Podcast Corpus Data Selection

AI-generated Key Points

Automatic summary assessment is valuable for evaluating machine-generated and human-produced summaries.
It helps in developing summary generation systems and detecting inappropriate summaries.
Summary assessment can be done by ranking summary generation systems, ranking specific document summaries, or estimating the quality of a document-summary pair on an absolute scale.
Existing datasets for summary assessment are typically based on news summarization datasets like CNN/DailyMail or XSum.
The podcast summary assessment corpus is a new dataset that contains long-input speech-based documents and was evaluated by human experts at TREC2020.
This dataset is unique as it allows identification of inappropriate reference summaries within the podcast corpus.
The authors explore existing assessment methods including model-free and model-based approaches and present benchmark results using this dataset.
Summary assessment is also applied for data selection to filter reference summary-document pairings for training purposes.
The experimental results provide insights into both the summary assessment and generation tasks.
The podcast summary assessment data is publicly available.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Potsawee Manakul, Mark J. F. Gales

arXiv: 2208.13265v1 - DOI (cs.CL)

License: CC BY 4.0

Abstract: Automatic summary assessment is useful for both machine-generated and human-produced summaries. Automatically evaluating the summary text given the document enables, for example, summary generation system development and detection of inappropriate summaries. Summary assessment can be run in a number of modes: ranking summary generation systems; ranking summaries of a particular document; and estimating the quality of a document-summary pair on an absolute scale. Existing datasets with annotation for summary assessment are usually based on news summarization datasets such as CNN/DailyMail or XSum. In this work, we describe a new dataset, the podcast summary assessment corpus, a collection of podcast summaries that were evaluated by human experts at TREC2020. Compared to existing summary assessment data, this dataset has two unique aspects: (i) long-input, speech podcast based, documents; and (ii) an opportunity to detect inappropriate reference summaries in podcast corpus. First, we examine existing assessment methods, including model-free and model-based methods, and provide benchmark results for this long-input summary assessment dataset. Second, with the aim of filtering reference summary-document pairings for training, we apply summary assessment for data selection. The experimental results on these two aspects provide interesting insights on the summary assessment and generation tasks. The podcast summary assessment data is available.

Submitted to arXiv on 28 Aug. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2208.13265v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

Automatic summary assessment is a valuable tool for evaluating both machine-generated and human-produced summaries. It allows for the development of summary generation systems and helps detect inappropriate summaries. Summary assessment can be conducted in various modes such as ranking summary generation systems, ranking summaries of a specific document or estimating the quality of a document-summary pair on an absolute scale. Existing datasets for summary assessment are typically based on news summarization datasets like CNN/DailyMail or XSum. This work introduces a new dataset called the podcast summary assessment corpus which consists of podcast summaries that were evaluated by human experts at TREC2020. This dataset is unique as it contains long-input speech-based documents and provides an opportunity to identify inappropriate reference summaries within the podcast corpus. The authors explore existing assessment methods including model-free and model-based approaches and present benchmark results using this long-input summary assessment dataset. Additionally, they apply summary assessment for data selection to filter reference summary-document pairings for training purposes. The experimental results shed light on both the summary assessment and generation tasks. The podcast summary assessment data is publicly available.

- Automatic summary assessment is valuable for evaluating machine-generated and human-produced summaries.
- It helps in developing summary generation systems and detecting inappropriate summaries.
- Summary assessment can be done by ranking summary generation systems, ranking specific document summaries, or estimating the quality of a document-summary pair on an absolute scale.
- Existing datasets for summary assessment are typically based on news summarization datasets like CNN/DailyMail or XSum.
- The podcast summary assessment corpus is a new dataset that contains long-input speech-based documents and was evaluated by human experts at TREC2020.
- This dataset is unique as it allows identification of inappropriate reference summaries within the podcast corpus.
- The authors explore existing assessment methods including model-free and model-based approaches and present benchmark results using this dataset.
- Summary assessment is also applied for data selection to filter reference summary-document pairings for training purposes.
- The experimental results provide insights into both the summary assessment and generation tasks.
- The podcast summary assessment data is publicly available.

Automatic summary assessment is a way to check if summaries made by machines or people are good. It helps make better summary systems and find summaries that are not right. There are different ways to do summary assessment, like ranking systems or estimating the quality of a summary. Most datasets for summary assessment are based on news summaries, but there is a new dataset for podcast summaries that was checked by experts. This dataset is special because it can find bad reference summaries in the podcast collection. The authors of the study used different methods to assess the summaries and shared their results. Summary assessment is also used to pick out good examples for training purposes. The experiment gave us new information about making and checking summaries, and the podcast dataset is available for everyone." Definitions- Automatic: happening without needing someone to control it - Summarize: make something shorter while keeping important points - Assessment: checking how good something is - Machine-generated: made by a machine - Human-produced: made by a person - Inappropriate: not suitable or right

Exploring the Benefits of Automatic Summary Assessment for Podcasts

The ability to accurately assess summaries is a valuable tool for evaluating both machine-generated and human-produced summaries. It allows for the development of summary generation systems and helps detect inappropriate summaries. Summary assessment can be conducted in various modes such as ranking summary generation systems, ranking summaries of a specific document or estimating the quality of a document-summary pair on an absolute scale. In this article, we will explore the benefits of automatic summary assessment for podcasts by introducing a new dataset called the podcast summary assessment corpus which consists of podcast summaries that were evaluated by human experts at TREC2020. We will also discuss existing datasets used for summary assessment, existing methods including model-free and model-based approaches, benchmark results using this long-input summary assessment dataset as well as how it can be applied to data selection tasks to filter reference summary-document pairings for training purposes.

Existing Datasets Used For Summary Assessment

Existing datasets used for summarization are typically based on news summarization datasets like CNN/DailyMail or XSum. These datasets contain short input documents with corresponding reference summaries which have been manually annotated by humans according to certain criteria such as relevance, informativeness and fluency. This type of dataset is useful when assessing short input documents but does not provide enough information about longer inputs such as those found in podcasts.

Introducing The Podcast Summary Assessment Corpus

The podcast summary assessment corpus is unique because it contains long-input speech-based documents and provides an opportunity to identify inappropriate reference summaries within the podcast corpus. The authors collected audio recordings from various sources including YouTube videos, TED talks and other online sources and then created corresponding transcripts with manual annotations indicating speaker changes, pauses etc., before creating reference summaries based on these transcripts through crowd sourcing platforms like Amazon Mechanical Turk (AMT).

Evaluating Existing Methods For Summary Assessment

The authors explored existing methods including model free approaches such as ROUGE scores (Recall Oriented Understudy Gist Evaluation) which measure similarity between two texts using ngram overlap metrics; BLEU scores (Bilingual Evaluation Understudy) which measure similarity between two texts using precision metrics; METEOR scores (Metric For Evaluation Of Translation With Explicit Ordering) which measures similarity between two texts using harmonic mean; CIDEr scores (Consensus Based Image Description Evaluation) which measures similarity between two texts using cosine distance; SARI scores (System Output Against Reference Inputs) which measures fluency by comparing system output against references inputs; SPICE scores (Semantic Propositional Image Caption Evaluation), etc., as well as model based approaches such as neural network models trained on large corpora like Gigaword or WikiText103 .

Benchmark Results Using The Podcast Summary Assessment Data

The authors present benchmark results using this long input summary assessment dataset showing that their proposed method outperforms existing methods in terms of accuracy when measuring both relevance and informativeness across different types of text genres including narrative stories, dialogues etc.. Additionally they apply their proposed method to data selection tasks where they use it to filter out irrelevant reference summarizations from training data sets resulting in improved performance in downstream tasks such as abstractive summarization compared with models trained without filtering out irrelevant references during preprocessing steps .

Conclusion

The experimental results shed light on both the importance of accurate automatic summary assessments when dealing with long input speech based documents like podcasts ,as well providing insights into how it can be applied to improve downstream tasks related to natural language processing . Furthermore ,the podcast summary assessment data set is publicly available making it easier for researchers interested in exploring similar topics .

Created on 11 Dec. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

67.2%

BARTScore: Evaluating Generated Text as Text Generation

cs.CL

65.5%

Benchmarking Large Language Models for News Summarization

cs.CL

64.1%

News Summarization and Evaluation in the Era of GPT-3

cs.CL

63.0%

TrueTeacher: Learning Factual Consistency Evaluation with Large Language Mode…

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.