BARTScore: Evaluating Generated Text as Text Generation

AI-generated keywords: BARTScore Evaluation Metrics Pre-trained models NLP

AI-generated Key Points

  • Authors focus on evaluating the quality of generated text in NLP applications
  • Introduce BARTScore, a method that uses pre-trained models to evaluate fluency, accuracy, and effectiveness of generated texts
  • BARTScore outperforms existing metrics in 16 out of 22 test settings across 16 datasets and 7 perspectives
  • BARTScore performs better than other metrics in most settings for text summarization tasks
  • Fine-tuning tasks can improve BARTScore's performance on some datasets but not others
  • BARTScore performs well in evaluating factuality compared to other metrics like FactCC and QAGS when using CNN as a fine-tuning task
  • Human evaluation is important for assessing text quality from different perspectives
  • Existing evaluation metrics have limitations, while BARTScore aims to provide a comprehensive approach
  • Code and interactive leaderboard are provided for researchers to evaluate different metrics.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Weizhe Yuan, Graham Neubig, Pengfei Liu

Demo at http://explainaboard.nlpedia.ai/leaderboard/task-meval/
License: CC ZERO 1.0

Abstract: A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference output or the source text will achieve higher scores when the generated text is better. We operationalize this idea using BART, an encoder-decoder based pre-trained model, and propose a metric BARTScore with a number of variants that can be flexibly applied in an unsupervised fashion to evaluation of text from different perspectives (e.g. informativeness, fluency, or factuality). BARTScore is conceptually simple and empirically effective. It can outperform existing top-scoring metrics in 16 of 22 test settings, covering evaluation of 16 datasets (e.g., machine translation, text summarization) and 7 different perspectives (e.g., informativeness, factuality). Code to calculate BARTScore is available at https://github.com/neulab/BARTScore, and we have released an interactive leaderboard for meta-evaluation at http://explainaboard.nlpedia.ai/leaderboard/task-meval/ on the ExplainaBoard platform, which allows us to interactively understand the strengths, weaknesses, and complementarity of each metric.

Submitted to arXiv on 22 Jun. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2106.11520v1

In this work, the authors focus on evaluating the quality of generated text in various natural language processing (NLP) applications such as machine translation, summarization, and dialog. They propose a method called BARTScore which utilizes pre-trained sequence-to-sequence models to evaluate the fluency, accuracy and effectiveness of generated texts. The authors use BART, an encoder-decoder based pre-trained model to develop the BARTScore metric. They demonstrate that BARTScore outperforms existing metrics in 16 out of 22 test settings across 16 datasets and 7 different perspectives. The authors specifically analyze the performance of BARTScore in text summarization tasks. They compare it with other metrics such as BERTScore and MoverScore and find that vanilla BARTScore performs significantly better on most settings except for the informativeness perspective on SummEval dataset. They also observe that introducing fine-tuning tasks can further improve the performance on some datasets but not on others. Additionally, they analyze how well BARTScore performs in evaluating factuality in short generated summaries compared to human baselines and other factuality metrics like FactCC and QAGS. They find that using CNN as a fine-tuning task improves BARTScore's performance significantly compared to other metrics. The authors discuss the importance of human evaluation as a gold-standard method for assessing text quality from different perspectives such as informativeness, relevance, fluency, coherence, factuality, semantic coverage and adequacy. Existing evaluation metrics were designed to cover only a subset of these perspectives or require separate judgments for each type. The proposed BARTScore metric aims to address these limitations by providing a comprehensive evaluation approach. Overall, this work presents an effective method for evaluating generated text using pre-trained models like BART and introduces the BARTScore metric that outperforms existing top-scoring metrics in various NLP applications. The authors provide code to calculate BARTScore and an interactive leaderboard for meta-evaluation allowing researchers to understand the strengths weaknesses and complementarity of different evaluation metrics.
Created on 26 Sep. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.