Large language models effectively leverage document-level context for literary translation, but critical errors persist

AI-generated keywords: Large Language Models Document-Level Context Literary Translation Critical Errors Human Evaluation

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Large language models (LLMs) are effective in translating paragraphs and documents
  • Evaluating LLMs' performance on larger units of text is challenging due to cost and difficulty
  • Authors conducted a human evaluation of Gpt-3.5 LLM's ability to translate literary paragraphs across 18 diverse languages
  • Results show that discourse-level LLM translators commit fewer errors than sentence-level approaches
  • Critical errors still exist, including occasional content omissions that require human intervention
  • The evaluation took approximately 350 hours of effort for annotation and analysis
  • The authors publicly release their dataset and error annotations to spur future research on evaluation of document-level literary translation
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Marzena Karpinska, Mohit Iyyer

preprint (30 pages)

Abstract: Large language models (LLMs) are competitive with the state of the art on a wide range of sentence-level translation datasets. However, their ability to translate paragraphs and documents remains unexplored because evaluation in these settings is costly and difficult. We show through a rigorous human evaluation that asking the Gpt-3.5 (text-davinci-003) LLM to translate an entire literary paragraph (e.g., from a novel) at once results in higher-quality translations than standard sentence-by-sentence translation across 18 linguistically-diverse language pairs (e.g., translating into and out of Japanese, Polish, and English). Our evaluation, which took approximately 350 hours of effort for annotation and analysis, is conducted by hiring translators fluent in both the source and target language and asking them to provide both span-level error annotations as well as preference judgments of which system's translations are better. We observe that discourse-level LLM translators commit fewer mistranslations, grammar errors, and stylistic inconsistencies than sentence-level approaches. With that said, critical errors still abound, including occasional content omissions, and a human translator's intervention remains necessary to ensure that the author's voice remains intact. We publicly release our dataset and error annotations to spur future research on evaluation of document-level literary translation.

Submitted to arXiv on 06 Apr. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2304.03245v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "Large language models effectively leverage document-level context for literary translation, but critical errors persist," authors Marzena Karpinska and Mohit Iyyer explore the ability of large language models (LLMs) to translate paragraphs and documents. While LLMs have shown competitive performance on sentence-level translation datasets, evaluating their performance on larger units of text such as paragraphs and documents has been challenging due to the cost and difficulty involved in conducting such evaluations. The authors conduct a rigorous human evaluation of Gpt-3.5 (text-davinci-003) LLM's ability to translate an entire literary paragraph from novels across 18 linguistically-diverse language pairs including Japanese, Polish, and English. The evaluation involves hiring translators fluent in both the source and target languages to provide span-level error annotations as well as preference judgments of which system's translations are better. The results show that asking the LLM to translate an entire literary paragraph at once results in higher-quality translations than standard sentence-by-sentence translation approaches. Discourse-level LLM translators commit fewer mistranslations, grammar errors, and stylistic inconsistencies than sentence-level approaches. However, critical errors still abound, including occasional content omissions that require human translator intervention to ensure that the author's voice remains intact. The evaluation took approximately 350 hours of effort for annotation and analysis, and the authors publicly release their dataset and error annotations to spur future research on evaluation of document-level literary translation. Overall, this study highlights the potential of LLMs in translating larger units of text while also emphasizing the need for continued improvement in machine translation technology to minimize critical errors.
Created on 08 Apr. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.