What do Large Language Models Need for Machine Translation Evaluation?

AI-generated keywords: Large language models Natural language processing Machine translation Quality evaluation Prompting techniques

AI-generated Key Points

  • Large language models (LLMs) have improved applications in natural language processing tasks like machine translation, text summarization, and information retrieval.
  • Enhanced natural language understanding capabilities and contextual awareness contribute to the advancements of LLMs.
  • Traditional approaches for evaluating machine translation quality include metrics like BLEU and fine-tuning multilingual pre-trained language models (PTLMs) on human evaluation data.
  • Recent studies have explored prompting techniques to instruct LLMs for translation quality evaluation without training.
  • Reference translations are crucial for accurate evaluations using LLMs across different languages.
  • Larger LLM variants tend to benefit more from Chain-of-Thought (CoT) prompting than smaller ones, but model size doesn't always correlate with performance.
  • A 7-billion parameter model outperformed other models for most languages, indicating that larger size doesn't always mean better performance.
  • COMET models fine-tuned on multilingual data generally outperformed LLM prompting results in MT quality evaluation.
  • The study emphasizes the importance of reference translations and provides insights into factors influencing LLM-based MT evaluation.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Shenbin Qian, Archchana Sindhujan, Minnie Kabra, Diptesh Kanojia, Constantin Orăsan, Tharindu Ranasinghe, Frédéric Blain

License: CC BY 4.0

Abstract: Leveraging large language models (LLMs) for various natural language processing tasks has led to superlative claims about their performance. For the evaluation of machine translation (MT), existing research shows that LLMs are able to achieve results comparable to fine-tuned multilingual pre-trained language models. In this paper, we explore what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate MT quality. In addition, we investigate prompting techniques such as zero-shot, Chain of Thought (CoT) and few-shot prompting for eight language pairs covering high-, medium- and low-resource languages, leveraging varying LLM variants. Our findings indicate the importance of reference translations for an LLM-based evaluation. While larger models do not necessarily fare better, they tend to benefit more from CoT prompting, than smaller models. We also observe that LLMs do not always provide a numerical score when generating evaluations, which poses a question on their reliability for the task. Our work presents a comprehensive analysis for resource-constrained and training-less LLM-based evaluation of machine translation. We release the accrued prompt templates, code and data publicly for reproducibility.

Submitted to arXiv on 04 Oct. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2410.03278v1

The use of large language models (LLMs) has greatly improved their applications in natural language processing tasks, such as machine translation (MT), text summarization, and information retrieval. These advancements are attributed to the enhanced natural language understanding capabilities and contextual awareness of LLMs. Traditional approaches for evaluating MT quality rely on metrics like BLEU or fine-tuning multilingual pre-trained language models (PTLMs) on human evaluation data. However, recent studies have explored prompting techniques to instruct LLMs for translation quality evaluation. This study investigates the ability of LLMs in evaluating MT quality without training across eight different languages and suggests that reference translations play a crucial role in accurate evaluations. Additionally, larger LLM variants tend to benefit more from Chain-of-Thought (CoT) prompting than smaller ones. Surprisingly, a 7-billion parameter model outperformed other models for most languages, indicating that model size does not always correlate with performance. Furthermore, COMET models fine-tuned on multilingual data generally outperformed LLM prompting results. Overall, this study highlights the importance of reference translations and provides valuable insights into factors influencing LLM-based MT evaluation. The release of prompt templates, code, and data aims to facilitate reproducibility and further research in this area. The recent surge in the use of large language models (LLMs) has significantly improved their applications in various downstream tasks such as natural language processing (NLP), including machine translation (MT), text summarization, and information retrieval. These advancements are attributed to the enhanced natural language understanding capabilities and contextual awareness of LLMs, as well as their versatile knowledge base. Traditional approaches for evaluating MT quality rely on metrics like BLEU, BLEURT, or BERTScore to compare the MT output with a reference translation. In the absence of reference translations, quality estimation methods such as fine-tuning multilingual pre-trained language models (PTLMs) on human evaluation data are often used. However, recent studies have explored prompting techniques to instruct LLMs in outputting scores for translation quality evaluation. This study investigates the ability of LLMs to evaluate MT quality without training across eight different languages covering high-, medium-, and low-resource languages. The findings suggest that reference translations play a crucial role in achieving accurate evaluations using LLMs. While larger models do not always outperform smaller ones, they tend to benefit more from Chain-of-Thought (CoT) prompting than smaller model variants. Surprisingly, a 7-billion parameter model surpassed other models for most language pairs, indicating that model size does not always correlate with performance. Additionally, COMET models fine-tuned on multilingual data generally outperformed LLM prompting results. Overall, this study provides valuable insights into the factors influencing LLM-based MT evaluation and highlights the importance of reference translations in achieving accurate evaluations. The release of prompt templates, code, and data aims to facilitate reproducibility and further research in this area. Future directions may involve exploring additional prompting techniques and investigating the performance of larger LLM variants with higher computational costs.
Created on 18 Nov. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.