BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-oriented Text
AI-generated Key Points
- The research explores the effectiveness of automatic quality metrics in detecting sentiment-critical machine translation errors.
- Three canonical metrics (BLEU, METEOR, and BERTScore) are compared in terms of their performance on meaningless translations and meaningful translations with critical errors that distort sentiment.
- Current metrics are not sensitive enough to penalize sentiment-critical errors and can even give high scores to mistranslations.
- Fine-tuning these metrics is necessary to accurately assess sentiment-oriented text.
- Developing a sentiment-targeted evaluation measure is suggested to address this issue.
- BLEU compares n-grams between candidate and reference translations, while METEOR incorporates semantic information and applies importance weighting.
- Both BLEU and METEOR have limitations in assessing sentiment-critical errors.
- Improving quality metrics to capture sentiment-critical lexicon is emphasized for enhancing their performance with sentiment-oriented text.
Authors: Hadeel Saadany, Constantin Orasan
Abstract: Social media companies as well as authorities make extensive use of artificial intelligence (AI) tools to monitor postings of hate speech, celebrations of violence or profanity. Since AI software requires massive volumes of data to train computers, Machine Translation (MT) of the online content is commonly used to process posts written in several languages and hence augment the data needed for training. However, MT mistakes are a regular occurrence when translating sentiment-oriented user-generated content (UGC), especially when a low-resource language is involved. The adequacy of the whole process relies on the assumption that the evaluation metrics used give a reliable indication of the quality of the translation. In this paper, we assess the ability of automatic quality metrics to detect critical machine translation errors which can cause serious misunderstanding of the affect message. We compare the performance of three canonical metrics on meaningless translations where the semantic content is seriously impaired as compared to meaningful translations with a critical error which exclusively distorts the sentiment of the source text. We conclude that there is a need for fine-tuning of automatic metrics to make them more robust in detecting sentiment critical errors.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.