Large language models effectively leverage document-level context for literary translation, but critical errors persist

AI-generated keywords: Large Language Models Document-Level Context Literary Translation Critical Errors Human Evaluation

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Large language models (LLMs) are effective in translating paragraphs and documents
Evaluating LLMs' performance on larger units of text is challenging due to cost and difficulty
Authors conducted a human evaluation of Gpt-3.5 LLM's ability to translate literary paragraphs across 18 diverse languages
Results show that discourse-level LLM translators commit fewer errors than sentence-level approaches
Critical errors still exist, including occasional content omissions that require human intervention
The evaluation took approximately 350 hours of effort for annotation and analysis
The authors publicly release their dataset and error annotations to spur future research on evaluation of document-level literary translation

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Marzena Karpinska, Mohit Iyyer

arXiv: 2304.03245v1 - DOI (cs.CL)

preprint (30 pages)

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: Large language models (LLMs) are competitive with the state of the art on a wide range of sentence-level translation datasets. However, their ability to translate paragraphs and documents remains unexplored because evaluation in these settings is costly and difficult. We show through a rigorous human evaluation that asking the Gpt-3.5 (text-davinci-003) LLM to translate an entire literary paragraph (e.g., from a novel) at once results in higher-quality translations than standard sentence-by-sentence translation across 18 linguistically-diverse language pairs (e.g., translating into and out of Japanese, Polish, and English). Our evaluation, which took approximately 350 hours of effort for annotation and analysis, is conducted by hiring translators fluent in both the source and target language and asking them to provide both span-level error annotations as well as preference judgments of which system's translations are better. We observe that discourse-level LLM translators commit fewer mistranslations, grammar errors, and stylistic inconsistencies than sentence-level approaches. With that said, critical errors still abound, including occasional content omissions, and a human translator's intervention remains necessary to ensure that the author's voice remains intact. We publicly release our dataset and error annotations to spur future research on evaluation of document-level literary translation.

Submitted to arXiv on 06 Apr. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2304.03245v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

In their paper titled "Large language models effectively leverage document-level context for literary translation, but critical errors persist," authors Marzena Karpinska and Mohit Iyyer explore the ability of large language models (LLMs) to translate paragraphs and documents. While LLMs have shown competitive performance on sentence-level translation datasets, evaluating their performance on larger units of text such as paragraphs and documents has been challenging due to the cost and difficulty involved in conducting such evaluations. The authors conduct a rigorous human evaluation of Gpt-3.5 (text-davinci-003) LLM's ability to translate an entire literary paragraph from novels across 18 linguistically-diverse language pairs including Japanese, Polish, and English. The evaluation involves hiring translators fluent in both the source and target languages to provide span-level error annotations as well as preference judgments of which system's translations are better. The results show that asking the LLM to translate an entire literary paragraph at once results in higher-quality translations than standard sentence-by-sentence translation approaches. Discourse-level LLM translators commit fewer mistranslations, grammar errors, and stylistic inconsistencies than sentence-level approaches. However, critical errors still abound, including occasional content omissions that require human translator intervention to ensure that the author's voice remains intact. The evaluation took approximately 350 hours of effort for annotation and analysis, and the authors publicly release their dataset and error annotations to spur future research on evaluation of document-level literary translation. Overall, this study highlights the potential of LLMs in translating larger units of text while also emphasizing the need for continued improvement in machine translation technology to minimize critical errors.

- Large language models (LLMs) are effective in translating paragraphs and documents
- Evaluating LLMs' performance on larger units of text is challenging due to cost and difficulty
- Authors conducted a human evaluation of Gpt-3.5 LLM's ability to translate literary paragraphs across 18 diverse languages
- Results show that discourse-level LLM translators commit fewer errors than sentence-level approaches
- Critical errors still exist, including occasional content omissions that require human intervention
- The evaluation took approximately 350 hours of effort for annotation and analysis
- The authors publicly release their dataset and error annotations to spur future research on evaluation of document-level literary translation

Summary: Large language models (LLMs) are good at translating big pieces of text. It's hard to test how well they work on bigger pieces because it takes a lot of time and money. Some people tested one LLM called Gpt-3.5 by having humans check its translations in 18 different languages. They found that when the LLM looks at whole paragraphs instead of just sentences, it makes fewer mistakes. But sometimes it still misses important parts and needs a human to fix it. The people who did the test shared their information so other people can use it to make better LLMs. Definitions- Large language models (LLMs): computer programs that can understand and generate human language - Translate: change words from one language into another - Paragraphs: groups of sentences that talk about the same topic - Discourse-level: looking at a whole piece of text and how all the parts relate to each other - Sentence-level: looking at individual sentences without thinking about how they fit together with others - Annotations: notes or comments added to a piece of text to explain or highlight something

Large Language Models Effectively Leverage Document-Level Context for Literary Translation, But Critical Errors Persist

Evaluation Methodology

Karpinska and Iyyer hired translators fluent in both the source and target languages to provide span-level error annotations as well as preference judgments of which system's translations are better. The evaluation took approximately 350 hours of effort for annotation and analysis. The authors also publicly released their dataset and error annotations to spur future research on evaluation of document-level literary translation.

Results

The results show that asking the LLM to translate an entire literary paragraph at once results in higher quality translations than standard sentence-by-sentence translation approaches. Discourse level LLM translators commit fewer mistranslations, grammar errors, and stylistic inconsistencies than sentence level approaches. However, critical errors still abound including occasional content omissions that require human translator intervention to ensure that the author's voice remains intact.

Conclusion

Overall, this study highlights the potential of LLMs in translating larger units of text while also emphasizing the need for continued improvement in machine translation technology to minimize critical errors.

Created on 08 Apr. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

⚠The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

88.1%

Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

cs.CL

87.5%

From Query Tools to Causal Architects: Harnessing Large Language Models for A…

cs.AI

86.9%

Using Language Models For Knowledge Acquisition in Natural Language Reasoning…

cs.AI

86.7%

Can Large Language Models Transform Computational Social Science?

cs.CL

86.2%

CodeGen2: Lessons for Training LLMs on Programming and Natural Languages

cs.LG

85.6%

Large Language Models are not Models of Natural Language: they are Corpus Mod…

cs.CL

85.3%

Emergent autonomous scientific research capabilities of large language models

physics.chem-ph

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.