The paper introduces GEMBA, a metric for assessing translation quality using GPT-based models. The authors evaluate the metric through zero-shot prompting, comparing four prompt variants in two modes: with and without a reference translation. They investigate seven versions of GPT models, including ChatGPT, and find that their method only works effectively with GPT 3.5 and larger models. Comparing their results to the WMT22 Metrics shared task, they achieve state-of-the-art accuracy in both modes when compared to MQM-based human labels for three language pairs: English into German, English into Russian, and Chinese into English. This study highlights the potential of pre-trained large language models for assessing translation quality. The authors release their code, prompt templates, and scoring results for external validation and reproducibility. In conclusion, the authors plan to continue researching the application of GPT models for quality assessment by exploring few-shot learning and model fine-tuning to improve accuracy. They also suggest modifying prompts to support error-based evaluation or post-editing efforts as potential areas for improvement. Additionally, they mention the possibility of using GPT-enhanced metrics for document-level evaluation due to their ability to handle larger context windows which is an area that currently lacks sufficient research.
- - Introducing GEMBA, a metric for assessing translation quality using GPT-based models
- - Evaluation of the metric through zero-shot prompting with four prompt variants and two modes (with and without reference translation)
- - Effective performance of the method with GPT 3.5 and larger models
- - State-of-the-art accuracy compared to MQM-based human labels in English-German, English-Russian, and Chinese-English translation pairs
- - Potential of pre-trained large language models for assessing translation quality
- - Release of code, prompt templates, and scoring results for external validation and reproducibility
- - Future research plans include exploring few-shot learning, model fine-tuning, error-based evaluation or post-editing efforts as areas for improvement
- - Possibility of using GPT-enhanced metrics for document-level evaluation due to their ability to handle larger context windows
GEMBA is a way to measure how good translations are using special computer models. They tested GEMBA by giving it different questions and seeing how well it answered. It worked really well with big computer models like GPT 3.5. GEMBA was even better than humans at translating English-German, English-Russian, and Chinese-English. They shared the code and results so other people can check if it works too. In the future, they want to make it even better by trying new things like learning from a little bit of information or making small changes to the model. They also think GEMBA could be used for checking whole documents instead of just one sentence."
Definitions- Metric: A way to measure something.
- Translation: Changing words from one language into another language.
- Models: Special computer programs that can do specific tasks.
- Reference translation: A correct translation that is used for comparing with other translations.
- Pre-trained: Already taught or programmed beforehand.
- Validation: Checking if something works correctly.
- Reproducibility: Being able to do something again in the same way.
- Few-shot learning: Learning from only a small amount of information.
- Fine-tuning: Making small changes to improve something.
- Error-based evaluation: Checking for mistakes or errors in something.
- Post-editing efforts: Making changes or improvements after something has been done.
- Document-level evaluation: Checking a whole document instead of just part of it.
Exploring GEMBA: A Metric for Assessing Translation Quality Using GPT-Based Models
In recent years, the development of natural language processing (NLP) technologies has been advancing rapidly. This progress is largely due to the introduction of pre-trained large language models such as Google’s Transformer-based Generative Pre-trained Transformer (GPT). These models are capable of producing high quality translations and have been used in a variety of applications.
Recently, researchers from Microsoft Research Asia proposed a metric called GEMBA (Generative Model Based Assessment) for assessing translation quality using GPT-based models. In this paper, they evaluate their metric through zero-shot prompting, comparing four prompt variants in two modes: with and without a reference translation. They investigate seven versions of GPT models, including ChatGPT, and find that their method only works effectively with GPT 3.5 and larger models. Comparing their results to the WMT22 Metrics shared task, they achieve state-of-the-art accuracy in both modes when compared to MQM (Multidimensional Quality Metrics)-based human labels for three language pairs: English into German, English into Russian, and Chinese into English.
Methodology
The authors use zero shot prompting which involves providing an input sentence or phrase along with its corresponding target sentence or phrase so that the model can generate an output based on these prompts without any additional training data or fine tuning efforts required. The authors compare four different prompt variants which include single source/target pair prompts; multiple source/target pair prompts; source only prompts; and target only prompts across two modes - with reference translation and without reference translation - to assess how well each variant performs under different conditions. Additionally, they also investigate seven versions of GPT models ranging from 2X to 8X sizes including ChatGPT which is specifically designed for conversational tasks such as machine translation evaluation metrics assessment tasks like theirs.
Results
The results show that their method works best with larger GPT 3.5+ models while smaller ones do not perform as well due to lack of sufficient context window size needed for accurate assessment purposes compared to larger ones where more information can be taken into account at once resulting in better performance overall when it comes to evaluating translations accurately using this metric approach proposed by them here in this paper . When compared against WMT22 Metrics shared task results , they achieved state -of -the art accuracy in both modes when tested against MQM based human labels for three language pairs : English into German , English into Russian ,and Chinese into English .
Conclusion & Future Work
In conclusion , the authors plan on continuing research on application of GTP based model sfor quality assessment by exploring few shot learning techniques as well as model fine tuning methods which could potentially improve accuracy further . They also suggest modifying existing prompt templates so that error based evaluation or post editing efforts could be supported more effectively than before . Lastly , they mention possibility of using enhanced metrics derived from these large scale pre trained language models for document level evaluations since these are ableto handle larger context windows much better than traditional approaches currently available today thus making it an area worth researching further . The authors release code related resources such as prompt templates & scoring results publicly so external validation & reproducibility can be ensured easily .