Large Language Models Are State-of-the-Art Evaluators of Translation Quality

AI-generated keywords: GEMBA GPT-based models zero-shot prompting WMT22 Metrics document-level evaluation

AI-generated Key Points

  • Introducing GEMBA, a metric for assessing translation quality using GPT-based models
  • Evaluation of the metric through zero-shot prompting with four prompt variants and two modes (with and without reference translation)
  • Effective performance of the method with GPT 3.5 and larger models
  • State-of-the-art accuracy compared to MQM-based human labels in English-German, English-Russian, and Chinese-English translation pairs
  • Potential of pre-trained large language models for assessing translation quality
  • Release of code, prompt templates, and scoring results for external validation and reproducibility
  • Future research plans include exploring few-shot learning, model fine-tuning, error-based evaluation or post-editing efforts as areas for improvement
  • Possibility of using GPT-enhanced metrics for document-level evaluation due to their ability to handle larger context windows
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Tom Kocmi, Christian Federmann

10 pages, 8 tables, one figure
License: CC BY 4.0

Abstract: We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the availability of the reference. We investigate seven versions of GPT models, including ChatGPT. We show that our method for translation quality assessment only works with GPT 3.5 and larger models. Comparing to results from WMT22's Metrics shared task, our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels. Our results are valid on the system level for all three WMT22 Metrics shared task language pairs, namely English into German, English into Russian, and Chinese into English. This provides a first glimpse into the usefulness of pre-trained, generative large language models for quality assessment of translations. We publicly release all our code and prompt templates used for the experiments described in this work, as well as all corresponding scoring results, to allow for external validation and reproducibility.

Submitted to arXiv on 28 Feb. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2302.14520v1

The paper introduces GEMBA, a metric for assessing translation quality using GPT-based models. The authors evaluate the metric through zero-shot prompting, comparing four prompt variants in two modes: with and without a reference translation. They investigate seven versions of GPT models, including ChatGPT, and find that their method only works effectively with GPT 3.5 and larger models. Comparing their results to the WMT22 Metrics shared task, they achieve state-of-the-art accuracy in both modes when compared to MQM-based human labels for three language pairs: English into German, English into Russian, and Chinese into English. This study highlights the potential of pre-trained large language models for assessing translation quality. The authors release their code, prompt templates, and scoring results for external validation and reproducibility. In conclusion, the authors plan to continue researching the application of GPT models for quality assessment by exploring few-shot learning and model fine-tuning to improve accuracy. They also suggest modifying prompts to support error-based evaluation or post-editing efforts as potential areas for improvement. Additionally, they mention the possibility of using GPT-enhanced metrics for document-level evaluation due to their ability to handle larger context windows which is an area that currently lacks sufficient research.
Created on 08 Sep. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.