TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models

AI-generated keywords: Factual Consistency Natural Language Inference Generative Summarization TrueTeacher Synthetic Data

AI-generated Key Points

Generative summarization models often produce summaries that are factually inconsistent with their input documents.
Natural Language Inference (NLI) models are commonly used to evaluate factual consistency, but have limited success due to the lack of entailment phenomena in NLI datasets.
Previous work improved NLI models with synthetic training data generated by perturbing human-written summaries, but this approach has limitations in coverage and style differences from real model-generated summaries.
Large language models (LLMs) have shown promising results in directly evaluating generative tasks such as factual consistency in summarization, but they are too computationally expensive for practical use.
TrueTeacher is introduced as a method for generating synthetic data by annotating diverse model-generated summaries using a LLM.
TrueTeacher does not rely on human-written summaries and is multilingual by nature.
Experiments on the TRUE benchmark show that a student model trained using TrueTeacher data substantially outperforms both state-of-the-art models with similar capacity and the LLM teacher.
A systematic study demonstrates TrueTeacher's superiority and robustness to domain-shift compared to existing synthetic data generation methods.
The method also generalizes to multilingual scenarios using the mFACE dataset.
Finally, a large-scale synthetic dataset with 1.4M examples generated using TrueTeacher is released, highlighting the usefulness of this method for improving factual consistency evaluation in summarization models.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, Idan Szpektor

arXiv: 2305.11171v1 - DOI (cs.CL)

License: CC BY 4.0

Abstract: Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, the data is typically based on perturbed human-written summaries, which often differ in their characteristics from real model-generated summaries and have limited coverage of possible factual errors. Alternatively, large language models (LLMs) have recently shown promising results in directly evaluating generative tasks, but are too computationally expensive for practical use. Motivated by these limitations, we introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries using a LLM. Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature. Experiments on the TRUE benchmark show that a student model trained using our data, substantially outperforms both the state-of-the-art model with similar capacity, and the LLM teacher. In a systematic study, we compare TrueTeacher to existing synthetic data generation methods and demonstrate its superiority and robustness to domain-shift. Using the the mFACE dataset, we also show that our method generalizes to multilingual scenarios. Finally, we release a large-scale synthetic dataset with 1.4M examples generated using TrueTeacher.

Submitted to arXiv on 18 May. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2305.11171v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

Generative summarization models often generate summaries that are factually inconsistent with their input documents. To evaluate factual consistency, Natural Language Inference (NLI) models are commonly used but have limited success due to the lack of entailment phenomena in NLI datasets. Previous work improved NLI models with synthetic training data generated by perturbing human-written summaries, but this approach has limitations in coverage and style differences from real model-generated summaries. Large language models (LLMs) have shown promising results in directly evaluating generative tasks such as factual consistency in summarization; however, they are too computationally expensive for practical use. To address these limitations, TrueTeacher is introduced as a method for generating synthetic data by annotating diverse model-generated summaries using a LLM. Unlike prior work, TrueTeacher does not rely on human-written summaries and is multilingual by nature. Experiments on the TRUE benchmark show that a student model trained using TrueTeacher data substantially outperforms both state-of-the-art models with similar capacity and the LLM teacher. A systematic study demonstrates TrueTeacher's superiority and robustness to domain-shift compared to existing synthetic data generation methods. The method also generalizes to multilingual scenarios using the mFACE dataset. Finally, a large-scale synthetic dataset with 1.4M examples generated using TrueTeacher is released, highlighting the usefulness of this method for improving factual consistency evaluation in summarization models.

- Generative summarization models often produce summaries that are factually inconsistent with their input documents.
- Natural Language Inference (NLI) models are commonly used to evaluate factual consistency, but have limited success due to the lack of entailment phenomena in NLI datasets.
- Previous work improved NLI models with synthetic training data generated by perturbing human-written summaries, but this approach has limitations in coverage and style differences from real model-generated summaries.
- Large language models (LLMs) have shown promising results in directly evaluating generative tasks such as factual consistency in summarization, but they are too computationally expensive for practical use.
- TrueTeacher is introduced as a method for generating synthetic data by annotating diverse model-generated summaries using a LLM.
- TrueTeacher does not rely on human-written summaries and is multilingual by nature.
- Experiments on the TRUE benchmark show that a student model trained using TrueTeacher data substantially outperforms both state-of-the-art models with similar capacity and the LLM teacher.
- A systematic study demonstrates TrueTeacher's superiority and robustness to domain-shift compared to existing synthetic data generation methods.
- The method also generalizes to multilingual scenarios using the mFACE dataset.
- Finally, a large-scale synthetic dataset with 1.4M examples generated using TrueTeacher is released, highlighting the usefulness of this method for improving factual consistency evaluation in summarization models.

1. Sometimes computer programs that summarize information can make mistakes. 2. People use a type of computer program called Natural Language Inference (NLI) to check if the summaries are correct, but it doesn't always work well. 3. Some people tried to improve NLI by using examples from human-written summaries, but this method has some problems. 4. There is another type of computer program called Large Language Models (LLMs) that can check if summaries are correct, but they are too slow and expensive to use all the time. 5. A new method called TrueTeacher was created to help generate examples for training models to summarize information correctly. Definitions- Generative summarization models: Computer programs that create summaries of text automatically - Factual consistency: When a summary accurately represents the important information in the original text - Natural Language Inference (NLI): A type of computer program that checks if one sentence logically follows from another sentence - Synthetic training data: Examples created by a computer program instead of being written by humans - Large language models (LLMs): Very powerful and complex computer programs used for natural language processing tasks such as summarization - Multilingual: Able to understand and process multiple languages - Benchmark: A standard set of tests used to compare different methods or systems

Improving Factual Consistency Evaluation in Summarization Models with TrueTeacher

Generative summarization models are used to create summaries of documents, but often generate summaries that are factually inconsistent with their input documents. To evaluate the factual consistency of these generated summaries, Natural Language Inference (NLI) models have been used; however, they have limited success due to the lack of entailment phenomena in NLI datasets. Previous work has attempted to improve NLI models by generating synthetic training data from human-written summaries; however, this approach has its own limitations in coverage and style differences from real model-generated summaries. In order to address these limitations, a new method called TrueTeacher is introduced as a way for generating synthetic data by annotating diverse model-generated summaries using a large language model (LLM). Unlike prior work, TrueTeacher does not rely on human-written summaries and is multilingual by nature. Experiments conducted on the TRUE benchmark show that a student model trained using TrueTeacher data substantially outperforms both state-of-the-art models with similar capacity and the LLM teacher. A systematic study demonstrates TrueTeacher's superiority and robustness to domain shift compared to existing synthetic data generation methods. The method also generalizes well to multilingual scenarios using the mFACE dataset. Finally, a large scale synthetic dataset with 1.4 million examples generated using TrueTeacher is released which highlights its usefulness for improving factual consistency evaluation in summarization models.

Natural Language Inference Models

Natural Language Inference (NLI) models are commonly used for evaluating factual consistency of generative summarization models; however they have limited success due to lack of entailment phenomena present in NLI datasets. This means that while NLI can be used for evaluating whether two sentences contradict each other or not, it cannot determine if one sentence entails another or vice versa which makes it difficult for use when evaluating summary quality since many times there will be subtle nuances between what was said originally and what was summarized which may not necessarily contradict each other but still need further analysis before being accepted as true or false statements about the original document content .

Synthetic Training Data Generation

Previous work attempted to improve NLI performance by generating synthetic training data from human written summaries; however this approach had its own limitations such as coverage issues and style differences between real model generated summaries versus those written by humans manually . To address these issues ,True Teacher was developed as an alternative method for generating synthetic training data . It works by annotating diverse model generated summaries using a large language model (LLM). Unlike prior approaches ,True Teacher does not rely on human written summary inputs making it more versatile than traditional methods .

Experimental Results

Experiments conducted on the TRUE benchmark showed that student models trained using True Teacher data substantially outperformed both state -of -the art models with similar capacity and even the LLM teacher itself . Furthermore ,a systematic study demonstrated superior results when comparing against existing synthetic training generation methods especially when considering domain shifts across different types of text sources such as news articles versus webpages etc.. Additionally ,multilingual experiments were also conducted successfully utilizing mFACE dataset showing generalizability across languages without any additional modifications needed . Finally ,a large scale dataset consisting of 1 .4 million examples created via True Teacher was released highlighting its potential usefulness when attempting evaluate factual consistency within summarization tasks at scale .

Conclusion

In conclusion ,True Teacher provides an effective solution towards improving factual consistency evaluation within summarization tasks through creating high quality synthetically generated datasets without relying on manual annotation processes like previous approaches did thus saving time & resources while providing better accuracy & scalability than traditional methods could offer

Created on 19 May. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

64.2%

Benchmarking Large Language Models for News Summarization

cs.CL

60.3%

Training a Helpful and Harmless Assistant with Reinforcement Learning from Hu…

cs.CL

58.6%

data2vec: A General Framework for Self-supervised Learning in Speech, Vision …

cs.LG

58.4%

LLaMA: Open and Efficient Foundation Language Models

cs.CL

58.2%

An Empirical Survey of Data Augmentation for Limited Data Learning in NLP

cs.CL

57.3%

Self-critiquing models for assisting human evaluators

cs.CL

57.3%

PADA: A Prompt-based Autoregressive Approach for Adaptation to Unseen Domains

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.