Large Language Models Are State-of-the-Art Evaluators of Translation Quality

AI-generated keywords: GEMBA GPT-based models zero-shot prompting WMT22 Metrics document-level evaluation

AI-generated Key Points

Introducing GEMBA, a metric for assessing translation quality using GPT-based models
Evaluation of the metric through zero-shot prompting with four prompt variants and two modes (with and without reference translation)
Effective performance of the method with GPT 3.5 and larger models
State-of-the-art accuracy compared to MQM-based human labels in English-German, English-Russian, and Chinese-English translation pairs
Potential of pre-trained large language models for assessing translation quality
Release of code, prompt templates, and scoring results for external validation and reproducibility
Future research plans include exploring few-shot learning, model fine-tuning, error-based evaluation or post-editing efforts as areas for improvement
Possibility of using GPT-enhanced metrics for document-level evaluation due to their ability to handle larger context windows

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Tom Kocmi, Christian Federmann

arXiv: 2302.14520v1 - DOI (cs.CL)

10 pages, 8 tables, one figure

License: CC BY 4.0

Abstract: We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the availability of the reference. We investigate seven versions of GPT models, including ChatGPT. We show that our method for translation quality assessment only works with GPT 3.5 and larger models. Comparing to results from WMT22's Metrics shared task, our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels. Our results are valid on the system level for all three WMT22 Metrics shared task language pairs, namely English into German, English into Russian, and Chinese into English. This provides a first glimpse into the usefulness of pre-trained, generative large language models for quality assessment of translations. We publicly release all our code and prompt templates used for the experiments described in this work, as well as all corresponding scoring results, to allow for external validation and reproducibility.

Submitted to arXiv on 28 Feb. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2302.14520v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

The paper introduces GEMBA, a metric for assessing translation quality using GPT-based models. The authors evaluate the metric through zero-shot prompting, comparing four prompt variants in two modes: with and without a reference translation. They investigate seven versions of GPT models, including ChatGPT, and find that their method only works effectively with GPT 3.5 and larger models. Comparing their results to the WMT22 Metrics shared task, they achieve state-of-the-art accuracy in both modes when compared to MQM-based human labels for three language pairs: English into German, English into Russian, and Chinese into English. This study highlights the potential of pre-trained large language models for assessing translation quality. The authors release their code, prompt templates, and scoring results for external validation and reproducibility. In conclusion, the authors plan to continue researching the application of GPT models for quality assessment by exploring few-shot learning and model fine-tuning to improve accuracy. They also suggest modifying prompts to support error-based evaluation or post-editing efforts as potential areas for improvement. Additionally, they mention the possibility of using GPT-enhanced metrics for document-level evaluation due to their ability to handle larger context windows which is an area that currently lacks sufficient research.

- Introducing GEMBA, a metric for assessing translation quality using GPT-based models
- Evaluation of the metric through zero-shot prompting with four prompt variants and two modes (with and without reference translation)
- Effective performance of the method with GPT 3.5 and larger models
- State-of-the-art accuracy compared to MQM-based human labels in English-German, English-Russian, and Chinese-English translation pairs
- Potential of pre-trained large language models for assessing translation quality
- Release of code, prompt templates, and scoring results for external validation and reproducibility
- Future research plans include exploring few-shot learning, model fine-tuning, error-based evaluation or post-editing efforts as areas for improvement
- Possibility of using GPT-enhanced metrics for document-level evaluation due to their ability to handle larger context windows

GEMBA is a way to measure how good translations are using special computer models. They tested GEMBA by giving it different questions and seeing how well it answered. It worked really well with big computer models like GPT 3.5. GEMBA was even better than humans at translating English-German, English-Russian, and Chinese-English. They shared the code and results so other people can check if it works too. In the future, they want to make it even better by trying new things like learning from a little bit of information or making small changes to the model. They also think GEMBA could be used for checking whole documents instead of just one sentence." Definitions- Metric: A way to measure something. - Translation: Changing words from one language into another language. - Models: Special computer programs that can do specific tasks. - Reference translation: A correct translation that is used for comparing with other translations. - Pre-trained: Already taught or programmed beforehand. - Validation: Checking if something works correctly. - Reproducibility: Being able to do something again in the same way. - Few-shot learning: Learning from only a small amount of information. - Fine-tuning: Making small changes to improve something. - Error-based evaluation: Checking for mistakes or errors in something. - Post-editing efforts: Making changes or improvements after something has been done. - Document-level evaluation: Checking a whole document instead of just part of it.

Exploring GEMBA: A Metric for Assessing Translation Quality Using GPT-Based Models

In recent years, the development of natural language processing (NLP) technologies has been advancing rapidly. This progress is largely due to the introduction of pre-trained large language models such as Google’s Transformer-based Generative Pre-trained Transformer (GPT). These models are capable of producing high quality translations and have been used in a variety of applications. Recently, researchers from Microsoft Research Asia proposed a metric called GEMBA (Generative Model Based Assessment) for assessing translation quality using GPT-based models. In this paper, they evaluate their metric through zero-shot prompting, comparing four prompt variants in two modes: with and without a reference translation. They investigate seven versions of GPT models, including ChatGPT, and find that their method only works effectively with GPT 3.5 and larger models. Comparing their results to the WMT22 Metrics shared task, they achieve state-of-the-art accuracy in both modes when compared to MQM (Multidimensional Quality Metrics)-based human labels for three language pairs: English into German, English into Russian, and Chinese into English.

Methodology

The authors use zero shot prompting which involves providing an input sentence or phrase along with its corresponding target sentence or phrase so that the model can generate an output based on these prompts without any additional training data or fine tuning efforts required. The authors compare four different prompt variants which include single source/target pair prompts; multiple source/target pair prompts; source only prompts; and target only prompts across two modes - with reference translation and without reference translation - to assess how well each variant performs under different conditions. Additionally, they also investigate seven versions of GPT models ranging from 2X to 8X sizes including ChatGPT which is specifically designed for conversational tasks such as machine translation evaluation metrics assessment tasks like theirs.

Results

The results show that their method works best with larger GPT 3.5+ models while smaller ones do not perform as well due to lack of sufficient context window size needed for accurate assessment purposes compared to larger ones where more information can be taken into account at once resulting in better performance overall when it comes to evaluating translations accurately using this metric approach proposed by them here in this paper . When compared against WMT22 Metrics shared task results , they achieved state -of -the art accuracy in both modes when tested against MQM based human labels for three language pairs : English into German , English into Russian ,and Chinese into English .

Conclusion & Future Work

In conclusion , the authors plan on continuing research on application of GTP based model sfor quality assessment by exploring few shot learning techniques as well as model fine tuning methods which could potentially improve accuracy further . They also suggest modifying existing prompt templates so that error based evaluation or post editing efforts could be supported more effectively than before . Lastly , they mention possibility of using enhanced metrics derived from these large scale pre trained language models for document level evaluations since these are ableto handle larger context windows much better than traditional approaches currently available today thus making it an area worth researching further . The authors release code related resources such as prompt templates & scoring results publicly so external validation & reproducibility can be ensured easily .

Created on 08 Sep. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

65.8%

How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation

cs.CL

63.7%

LIMA: Less Is More for Alignment

cs.CL

62.5%

Predicting Perfect Quality Segments in MT Output with Fine-Tuned OpenAI LLM: …

cs.CL

62.4%

News Summarization and Evaluation in the Era of GPT-3

cs.CL

62.3%

LLM-powered Data Augmentation for Enhanced Crosslingual Performance

cs.CL

62.1%

BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Criti…

cs.CL

61.9%

Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.