FABLES: Evaluating faithfulness and content selection in book-length summarization

AI-generated keywords: FABLES

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors conducted a large-scale human evaluation on summaries of books published in 2023 or 2024
  • Created the FABLES dataset at a cost of $5.2K USD
  • Ranked LLM summarizers based on faithfulness, with Claude-3-Opus outperforming closed-source LLMs significantly
  • Most unfaithful claims in the summaries related to events and character states, requiring indirect reasoning to invalidate
  • LLM-based auto-raters for factuality and coherence did not strongly correlate with human annotations when detecting unfaithful claims
  • Importance of detecting unfaithful claims highlighted as a crucial future direction for summarization evaluation
  • Explored content selection errors by developing a typology of omission errors related to crucial narrative elements
  • Systematic over-emphasis on events occurring towards the end of the book
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, Varun Manjunatha, Kyle Lo, Tanya Goyal, Mohit Iyyer

preprint - 39 pages

Abstract: While long-context large language models (LLMs) can technically summarize book-length documents (>100K tokens), the length and complexity of the documents have so far prohibited evaluations of input-dependent aspects like faithfulness. In this paper, we conduct the first large-scale human evaluation of faithfulness and content selection on LLM-generated summaries of fictional books. Our study mitigates the issue of data contamination by focusing on summaries of books published in 2023 or 2024, and we hire annotators who have fully read each book prior to the annotation task to minimize cost and cognitive burden. We collect FABLES, a dataset of annotations on 3,158 claims made in LLM-generated summaries of 26 books, at a cost of $5.2K USD, which allows us to rank LLM summarizers based on faithfulness: Claude-3-Opus significantly outperforms all closed-source LLMs, while the open-source Mixtral is on par with GPT-3.5-Turbo. An analysis of the annotations reveals that most unfaithful claims relate to events and character states, and they generally require indirect reasoning over the narrative to invalidate. While LLM-based auto-raters have proven reliable for factuality and coherence in other settings, we implement several LLM raters of faithfulness and find that none correlates strongly with human annotations, especially with regard to detecting unfaithful claims. Our experiments suggest that detecting unfaithful claims is an important future direction not only for summarization evaluation but also as a testbed for long-context understanding. Finally, we move beyond faithfulness by exploring content selection errors in book-length summarization: we develop a typology of omission errors related to crucial narrative elements and also identify a systematic over-emphasis on events occurring towards the end of the book.

Submitted to arXiv on 01 Apr. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2404.01261v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

, , , , In their paper titled "FABLES: Evaluating Faithfulness and Content Selection in Book-Length Summarization," authors Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, Varun Manjunatha, Kyle Lo, Tanya Goyal, and Mohit Iyyer address the challenges of evaluating faithfulness and content selection in summaries generated by long-context large language models (LLMs) for fictional books. The authors conducted a large-scale human evaluation focusing on summaries of books published in 2023 or 2024 to mitigate data contamination issues. The study involved annotators who had fully read each book prior to the annotation task to ensure accuracy. The authors collected annotations on 3,158 claims made in LLM-generated summaries of 26 books, resulting in the creation of the FABLES dataset at a cost of $5.2K USD. This dataset allowed them to rank LLM summarizers based on faithfulness, with Claude-3-Opus outperforming closed-source LLMs significantly and the open-source Mixtral performing on par with GPT-3.5-Turbo. Analysis of the annotations revealed that most unfaithful claims in the summaries related to events and character states, often requiring indirect reasoning over the narrative to invalidate. Despite the reliability of LLM-based auto-raters for factuality and coherence in other contexts, several LLM raters of faithfulness implemented by the authors did not correlate strongly with human annotations when detecting unfaithful claims. The experiments conducted by the authors highlighted the importance of detecting unfaithful claims as a crucial future direction for summarization evaluation and as a testbed for long-context understanding. Additionally, they explored content selection errors in book-length summarization by developing a typology of omission errors related to crucial narrative elements. They also identified a systematic over-emphasis on events occurring towards the end of the book. Overall, this research sheds light on the challenges and opportunities in evaluating faithfulness and content selection in LLM-generated summaries of fictional books, providing valuable insights for future advancements in summarization technology.
Created on 19 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.