Generative summarization models often generate summaries that are factually inconsistent with their input documents. To evaluate factual consistency, Natural Language Inference (NLI) models are commonly used but have limited success due to the lack of entailment phenomena in NLI datasets. Previous work improved NLI models with synthetic training data generated by perturbing human-written summaries, but this approach has limitations in coverage and style differences from real model-generated summaries. Large language models (LLMs) have shown promising results in directly evaluating generative tasks such as factual consistency in summarization; however, they are too computationally expensive for practical use. To address these limitations, TrueTeacher is introduced as a method for generating synthetic data by annotating diverse model-generated summaries using a LLM. Unlike prior work, TrueTeacher does not rely on human-written summaries and is multilingual by nature. Experiments on the TRUE benchmark show that a student model trained using TrueTeacher data substantially outperforms both state-of-the-art models with similar capacity and the LLM teacher. A systematic study demonstrates TrueTeacher's superiority and robustness to domain-shift compared to existing synthetic data generation methods. The method also generalizes to multilingual scenarios using the mFACE dataset. Finally, a large-scale synthetic dataset with 1.4M examples generated using TrueTeacher is released, highlighting the usefulness of this method for improving factual consistency evaluation in summarization models.
- - Generative summarization models often produce summaries that are factually inconsistent with their input documents.
- - Natural Language Inference (NLI) models are commonly used to evaluate factual consistency, but have limited success due to the lack of entailment phenomena in NLI datasets.
- - Previous work improved NLI models with synthetic training data generated by perturbing human-written summaries, but this approach has limitations in coverage and style differences from real model-generated summaries.
- - Large language models (LLMs) have shown promising results in directly evaluating generative tasks such as factual consistency in summarization, but they are too computationally expensive for practical use.
- - TrueTeacher is introduced as a method for generating synthetic data by annotating diverse model-generated summaries using a LLM.
- - TrueTeacher does not rely on human-written summaries and is multilingual by nature.
- - Experiments on the TRUE benchmark show that a student model trained using TrueTeacher data substantially outperforms both state-of-the-art models with similar capacity and the LLM teacher.
- - A systematic study demonstrates TrueTeacher's superiority and robustness to domain-shift compared to existing synthetic data generation methods.
- - The method also generalizes to multilingual scenarios using the mFACE dataset.
- - Finally, a large-scale synthetic dataset with 1.4M examples generated using TrueTeacher is released, highlighting the usefulness of this method for improving factual consistency evaluation in summarization models.
1. Sometimes computer programs that summarize information can make mistakes.
2. People use a type of computer program called Natural Language Inference (NLI) to check if the summaries are correct, but it doesn't always work well.
3. Some people tried to improve NLI by using examples from human-written summaries, but this method has some problems.
4. There is another type of computer program called Large Language Models (LLMs) that can check if summaries are correct, but they are too slow and expensive to use all the time.
5. A new method called TrueTeacher was created to help generate examples for training models to summarize information correctly.
Definitions- Generative summarization models: Computer programs that create summaries of text automatically
- Factual consistency: When a summary accurately represents the important information in the original text
- Natural Language Inference (NLI): A type of computer program that checks if one sentence logically follows from another sentence
- Synthetic training data: Examples created by a computer program instead of being written by humans
- Large language models (LLMs): Very powerful and complex computer programs used for natural language processing tasks such as summarization
- Multilingual: Able to understand and process multiple languages
- Benchmark: A standard set of tests used to compare different methods or systems
Improving Factual Consistency Evaluation in Summarization Models with TrueTeacher
Generative summarization models are used to create summaries of documents, but often generate summaries that are factually inconsistent with their input documents. To evaluate the factual consistency of these generated summaries, Natural Language Inference (NLI) models have been used; however, they have limited success due to the lack of entailment phenomena in NLI datasets. Previous work has attempted to improve NLI models by generating synthetic training data from human-written summaries; however, this approach has its own limitations in coverage and style differences from real model-generated summaries.
In order to address these limitations, a new method called TrueTeacher is introduced as a way for generating synthetic data by annotating diverse model-generated summaries using a large language model (LLM). Unlike prior work, TrueTeacher does not rely on human-written summaries and is multilingual by nature. Experiments conducted on the TRUE benchmark show that a student model trained using TrueTeacher data substantially outperforms both state-of-the-art models with similar capacity and the LLM teacher. A systematic study demonstrates TrueTeacher's superiority and robustness to domain shift compared to existing synthetic data generation methods. The method also generalizes well to multilingual scenarios using the mFACE dataset. Finally, a large scale synthetic dataset with 1.4 million examples generated using TrueTeacher is released which highlights its usefulness for improving factual consistency evaluation in summarization models.
Natural Language Inference Models
Natural Language Inference (NLI) models are commonly used for evaluating factual consistency of generative summarization models; however they have limited success due to lack of entailment phenomena present in NLI datasets. This means that while NLI can be used for evaluating whether two sentences contradict each other or not, it cannot determine if one sentence entails another or vice versa which makes it difficult for use when evaluating summary quality since many times there will be subtle nuances between what was said originally and what was summarized which may not necessarily contradict each other but still need further analysis before being accepted as true or false statements about the original document content .
Synthetic Training Data Generation
Previous work attempted to improve NLI performance by generating synthetic training data from human written summaries; however this approach had its own limitations such as coverage issues and style differences between real model generated summaries versus those written by humans manually . To address these issues ,True Teacher was developed as an alternative method for generating synthetic training data . It works by annotating diverse model generated summaries using a large language model (LLM). Unlike prior approaches ,True Teacher does not rely on human written summary inputs making it more versatile than traditional methods .
Experimental Results
Experiments conducted on the TRUE benchmark showed that student models trained using True Teacher data substantially outperformed both state -of -the art models with similar capacity and even the LLM teacher itself . Furthermore ,a systematic study demonstrated superior results when comparing against existing synthetic training generation methods especially when considering domain shifts across different types of text sources such as news articles versus webpages etc.. Additionally ,multilingual experiments were also conducted successfully utilizing mFACE dataset showing generalizability across languages without any additional modifications needed . Finally ,a large scale dataset consisting of 1 .4 million examples created via True Teacher was released highlighting its potential usefulness when attempting evaluate factual consistency within summarization tasks at scale .
Conclusion
In conclusion ,True Teacher provides an effective solution towards improving factual consistency evaluation within summarization tasks through creating high quality synthetically generated datasets without relying on manual annotation processes like previous approaches did thus saving time & resources while providing better accuracy & scalability than traditional methods could offer