FABLES: Evaluating faithfulness and content selection in book-length summarization

AI-generated keywords: FABLES

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Authors conducted a large-scale human evaluation on summaries of books published in 2023 or 2024
Created the FABLES dataset at a cost of $5.2K USD
Ranked LLM summarizers based on faithfulness, with Claude-3-Opus outperforming closed-source LLMs significantly
Most unfaithful claims in the summaries related to events and character states, requiring indirect reasoning to invalidate
LLM-based auto-raters for factuality and coherence did not strongly correlate with human annotations when detecting unfaithful claims
Importance of detecting unfaithful claims highlighted as a crucial future direction for summarization evaluation
Explored content selection errors by developing a typology of omission errors related to crucial narrative elements
Systematic over-emphasis on events occurring towards the end of the book

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, Varun Manjunatha, Kyle Lo, Tanya Goyal, Mohit Iyyer

arXiv: 2404.01261v1 - DOI (cs.CL)

preprint - 39 pages

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: While long-context large language models (LLMs) can technically summarize book-length documents (>100K tokens), the length and complexity of the documents have so far prohibited evaluations of input-dependent aspects like faithfulness. In this paper, we conduct the first large-scale human evaluation of faithfulness and content selection on LLM-generated summaries of fictional books. Our study mitigates the issue of data contamination by focusing on summaries of books published in 2023 or 2024, and we hire annotators who have fully read each book prior to the annotation task to minimize cost and cognitive burden. We collect FABLES, a dataset of annotations on 3,158 claims made in LLM-generated summaries of 26 books, at a cost of $5.2K USD, which allows us to rank LLM summarizers based on faithfulness: Claude-3-Opus significantly outperforms all closed-source LLMs, while the open-source Mixtral is on par with GPT-3.5-Turbo. An analysis of the annotations reveals that most unfaithful claims relate to events and character states, and they generally require indirect reasoning over the narrative to invalidate. While LLM-based auto-raters have proven reliable for factuality and coherence in other settings, we implement several LLM raters of faithfulness and find that none correlates strongly with human annotations, especially with regard to detecting unfaithful claims. Our experiments suggest that detecting unfaithful claims is an important future direction not only for summarization evaluation but also as a testbed for long-context understanding. Finally, we move beyond faithfulness by exploring content selection errors in book-length summarization: we develop a typology of omission errors related to crucial narrative elements and also identify a systematic over-emphasis on events occurring towards the end of the book.

Submitted to arXiv on 01 Apr. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2404.01261v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In their paper titled "FABLES: Evaluating Faithfulness and Content Selection in Book-Length Summarization," authors Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, Varun Manjunatha, Kyle Lo, Tanya Goyal, and Mohit Iyyer address the challenges of evaluating faithfulness and content selection in summaries generated by long-context large language models (LLMs) for fictional books. The authors conducted a large-scale human evaluation focusing on summaries of books published in 2023 or 2024 to mitigate data contamination issues. The study involved annotators who had fully read each book prior to the annotation task to ensure accuracy. The authors collected annotations on 3,158 claims made in LLM-generated summaries of 26 books, resulting in the creation of the FABLES dataset at a cost of $5.2K USD. This dataset allowed them to rank LLM summarizers based on faithfulness, with Claude-3-Opus outperforming closed-source LLMs significantly and the open-source Mixtral performing on par with GPT-3.5-Turbo. Analysis of the annotations revealed that most unfaithful claims in the summaries related to events and character states, often requiring indirect reasoning over the narrative to invalidate. Despite the reliability of LLM-based auto-raters for factuality and coherence in other contexts, several LLM raters of faithfulness implemented by the authors did not correlate strongly with human annotations when detecting unfaithful claims. The experiments conducted by the authors highlighted the importance of detecting unfaithful claims as a crucial future direction for summarization evaluation and as a testbed for long-context understanding. Additionally, they explored content selection errors in book-length summarization by developing a typology of omission errors related to crucial narrative elements. They also identified a systematic over-emphasis on events occurring towards the end of the book. Overall, this research sheds light on the challenges and opportunities in evaluating faithfulness and content selection in LLM-generated summaries of fictional books, providing valuable insights for future advancements in summarization technology.

- Authors conducted a large-scale human evaluation on summaries of books published in 2023 or 2024
- Created the FABLES dataset at a cost of $5.2K USD
- Ranked LLM summarizers based on faithfulness, with Claude-3-Opus outperforming closed-source LLMs significantly
- Most unfaithful claims in the summaries related to events and character states, requiring indirect reasoning to invalidate
- LLM-based auto-raters for factuality and coherence did not strongly correlate with human annotations when detecting unfaithful claims
- Importance of detecting unfaithful claims highlighted as a crucial future direction for summarization evaluation
- Explored content selection errors by developing a typology of omission errors related to crucial narrative elements
- Systematic over-emphasis on events occurring towards the end of the book

SummaryAuthors studied summaries of books to see if they were accurate. They made a dataset called FABLES for $5.2K USD. They compared different summarizers and found one named Claude-3-Opus was the best at being faithful. Some summaries had wrong information about events and characters, which needed careful thinking to correct. Machines that checked for accuracy and coherence didn't always match human judgment. It's important to find and fix mistakes in summaries. Definitions- Authors: People who write books or research papers. - Summaries: Short descriptions that tell the main points of a story or article. - Dataset: A collection of data used for analysis or research. - Faithfulness: Being true or accurate to the original source. - Unfaithful: Not truthful or accurate. - Coherence: Making sense or being logical. - Evaluation: Assessing something to determine its value or quality. - Omission errors: Leaving out important details or parts.

Introduction

With the rise of large language models (LLMs) such as GPT-3, there has been a growing interest in using these models for text summarization tasks. However, evaluating the faithfulness and content selection of LLM-generated summaries remains a challenge. In their paper titled "FABLES: Evaluating Faithfulness and Content Selection in Book-Length Summarization," authors Yekyung Kim et al. address this issue by conducting a large-scale human evaluation on summaries of fictional books generated by LLMs.

The FABLES Dataset

To evaluate faithfulness and content selection in book-length summarization, the authors created the FABLES dataset. This dataset consists of 3,158 claims made in LLM-generated summaries of 26 books published in 2023 or 2024. The authors ensured data accuracy by involving annotators who had fully read each book prior to the annotation task. The creation of this dataset was not without its challenges - it cost $5.2K USD to collect annotations from human annotators. However, this investment allowed them to rank LLM summarizers based on faithfulness.

Faithfulness Evaluation Results

The results of the faithfulness evaluation were quite interesting. The closed-source LLMs performed significantly worse than Claude-3-Opus, an open-source model developed by Facebook AI Research (FAIR). Surprisingly, Mixtral - another open-source model - performed on par with GPT-3.5-Turbo. Upon further analysis of the annotations, it was found that most unfaithful claims in the summaries related to events and character states that required indirect reasoning over the narrative to invalidate them. This highlights a crucial future direction for summarization evaluation - detecting unfaithful claims.

Challenges with Auto-Raters

The authors also explored the use of LLM-based auto-raters for evaluating faithfulness. While these auto-raters have shown reliability in other contexts, they did not correlate strongly with human annotations when detecting unfaithful claims in book-length summarization. This highlights the need for further research and development in this area.

Content Selection Errors

In addition to evaluating faithfulness, the authors also looked at content selection errors in LLM-generated summaries. They developed a typology of omission errors related to crucial narrative elements and found a systematic over-emphasis on events occurring towards the end of the book. This highlights another challenge in using LLMs for summarization - ensuring that important information from throughout the text is included in the summary.

Conclusion

Overall, this research paper provides valuable insights into evaluating faithfulness and content selection in LLM-generated summaries of fictional books. The creation of the FABLES dataset and analysis of its results shed light on both challenges and opportunities for future advancements in summarization technology. The findings from this study highlight the importance of detecting unfaithful claims as well as ensuring comprehensive content selection when using LLMs for summarization tasks. It also emphasizes the need for further research and development to improve auto-raters' performance in detecting unfaithful claims. As more advanced language models continue to emerge, it will be interesting to see how they perform on these evaluation metrics and what new challenges may arise. The FABLES dataset serves as a valuable resource for future studies on evaluating faithfulness and content selection in book-length summarization, providing a solid foundation for further advancements in this field.

Created on 19 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.