BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-oriented Text

AI-generated keywords: Sentiment-critical errors Quality metrics BLEU METEOR BERTScore

AI-generated Key Points

  • The research explores the effectiveness of automatic quality metrics in detecting sentiment-critical machine translation errors.
  • Three canonical metrics (BLEU, METEOR, and BERTScore) are compared in terms of their performance on meaningless translations and meaningful translations with critical errors that distort sentiment.
  • Current metrics are not sensitive enough to penalize sentiment-critical errors and can even give high scores to mistranslations.
  • Fine-tuning these metrics is necessary to accurately assess sentiment-oriented text.
  • Developing a sentiment-targeted evaluation measure is suggested to address this issue.
  • BLEU compares n-grams between candidate and reference translations, while METEOR incorporates semantic information and applies importance weighting.
  • Both BLEU and METEOR have limitations in assessing sentiment-critical errors.
  • Improving quality metrics to capture sentiment-critical lexicon is emphasized for enhancing their performance with sentiment-oriented text.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Hadeel Saadany, Constantin Orasan

TRITON (2021) 48-56
Accepted for TRITON (TRanslation and Interpreting Technology ONline) 2021
License: CC BY 4.0

Abstract: Social media companies as well as authorities make extensive use of artificial intelligence (AI) tools to monitor postings of hate speech, celebrations of violence or profanity. Since AI software requires massive volumes of data to train computers, Machine Translation (MT) of the online content is commonly used to process posts written in several languages and hence augment the data needed for training. However, MT mistakes are a regular occurrence when translating sentiment-oriented user-generated content (UGC), especially when a low-resource language is involved. The adequacy of the whole process relies on the assumption that the evaluation metrics used give a reliable indication of the quality of the translation. In this paper, we assess the ability of automatic quality metrics to detect critical machine translation errors which can cause serious misunderstanding of the affect message. We compare the performance of three canonical metrics on meaningless translations where the semantic content is seriously impaired as compared to meaningful translations with a critical error which exclusively distorts the sentiment of the source text. We conclude that there is a need for fine-tuning of automatic metrics to make them more robust in detecting sentiment critical errors.

Submitted to arXiv on 29 Sep. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2109.14250v1

In this research, the authors explore the effectiveness of automatic quality metrics in detecting sentiment-critical machine translation errors that distort user-generated content (UGC). They compare the performance of three canonical metrics – BLEU, METEOR and BERTScore – on both meaningless translations and meaningful translations with a critical error that exclusively distorts the sentiment. The results show that current metrics are not sensitive enough to penalize sentiment-critical errors and can even give high scores to mistranslations. This highlights the need for fine-tuning these metrics to accurately assess sentiment-oriented text. The authors suggest developing a sentiment-targeted evaluation measure to address this issue. BLEU compares n-grams between candidate and reference translations, while METEOR incorporates semantic information and applies importance weighting. However, both metrics have limitations in assessing sentiment-critical errors. The authors emphasize the importance of improving quality metrics to capture sentiment-critical lexicon and enhance their performance with sentiment-oriented text.
Created on 10 Jul. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.