Crosslingual Reasoning through Test-Time Scaling

AI-generated keywords: Crosslingual Reasoning

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Study conducted by Zheng-Xin Yong et al. on reasoning capabilities of large language models
  • Focus on English-centric models used for multilingual tasks
  • Exploration of generalization of English reasoning finetuning with long chain-of-thoughts (CoTs) across different languages
  • Scaling up inference compute for English-centric reasoning language models leads to improved multilingual mathematical reasoning
  • Scaled-up RLMs outperform larger models in certain scenarios
  • Observation of quote-and-think pattern in non-English inputs by English-centric RLMs
  • Effective strategy discovered to control language in long CoT reasoning processes
  • Models reason better and more efficiently in high-resource languages compared to low-resource ones
  • Challenges observed in out-of-domain reasoning generalization, particularly from STEM-related topics to cultural commonsense knowledge within the English language context
  • Conclusion highlights potentials and limitations of Test-Time Scaling for English-centric models
  • Recommendation to allow English-centric RLMs to reason in high-resource languages while enhancing capabilities in low-resource languages and out-of-domain contexts
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zheng-Xin Yong, M. Farid Adilazuarda, Jonibek Mansurov, Ruochen Zhang, Niklas Muennighoff, Carsten Eickhoff, Genta Indra Winata, Julia Kreutzer, Stephen H. Bach, Alham Fikri Aji

Abstract: Reasoning capabilities of large language models are primarily studied for English, even when pretrained models are multilingual. In this work, we investigate to what extent English reasoning finetuning with long chain-of-thoughts (CoTs) can generalize across languages. First, we find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages including low-resource languages, to an extent where they outperform models twice their size. Second, we reveal that while English-centric RLM's CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs. Third, we discover an effective strategy to control the language of long CoT reasoning, and we observe that models reason better and more efficiently in high-resource languages. Finally, we observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English. Overall, we demonstrate the potentials, study the mechanisms and outline the limitations of crosslingual generalization of English reasoning test-time scaling. We conclude that practitioners should let English-centric RLMs reason in high-resource languages, while further work is needed to improve reasoning in low-resource languages and out-of-domain contexts.

Submitted to arXiv on 08 May. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2505.05408v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In the study "Crosslingual Reasoning through Test-Time Scaling," conducted by Zheng-Xin Yong, M. Farid Adilazuarda, Jonibek Mansurov, Ruochen Zhang, Niklas Muennighoff, Carsten Eickhoff, Genta Indra Winata, Julia Kreutzer, Stephen H. Bach, and Alham Fikri Aji, the authors delve into the reasoning capabilities of large language models with a focus on English-centric models that are often used for multilingual tasks. The research aims to explore the extent to which English reasoning finetuning with long chain-of-thoughts (CoTs) can be generalized across different languages. The findings of the study indicate that scaling up inference compute for English-centric reasoning language models (RLMs) leads to improved multilingual mathematical reasoning across various languages, including low-resource languages. Surprisingly, these scaled-up RLMs even outperform models that are twice their size in certain scenarios. Additionally, it was observed that while English-centric RLMs predominantly use English in their CoTs, they consistently adopt a quote-and-think pattern when reasoning about non-English inputs. Furthermore, the researchers discovered an effective strategy to control the language used in long CoT reasoning processes. They noted that models tend to reason better and more efficiently in high-resource languages compared to low-resource ones. However, there were challenges observed in terms of out-of-domain reasoning generalization; particularly from STEM-related topics to cultural commonsense knowledge even within the English language context. In conclusion <kwd>, Crosslingual Reasoning </kwd> highlights both the potentials and limitations of <kwd> Test-Time Scaling </kwd> for English-centric models. The authors recommend allowing English-centric RLMs to reason in high-resource languages while acknowledging the need for further research to enhance reasoning capabilities in low-resource languages and out-of-domain contexts. This comprehensive analysis sheds light on important considerations for practitioners working with large language models in multilingual settings.
Created on 20 May. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.