PolyLM: An Open Source Polyglot Large Language Model

AI-generated keywords: PolyLM

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • PolyLM is a multilingual large language model (LLM) developed to address limitations of existing LLMs focused on high-resource languages like English.
  • It has been trained on a massive dataset of 640 billion tokens and is available in two sizes: 1.7B and 13B.
  • PolyLM integrates bilingual data into its training process and uses curriculum learning to gradually increase non-English data during pre-training.
  • It introduces a novel multilingual self-instruct method that generates 132.7K multilingual instructions for fine-tuning the model.
  • PolyLM outperforms other open-source models like LLaMA and BLOOM in multilingual tasks while maintaining comparable performance in English.
  • Access to the PolyLM models, instruction data, and a multilingual benchmark can be found at [https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation](https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation).
  • The research team includes Xiangpeng Wei, Haoran Wei, Huan Lin, Tianhao Li, Pei Zhang, Xingzhang Ren, Mei Li, Yu Wan, Zhiwei Cao, Binbin Xie, Tianxiang Hu , Shangjie Li , Binyuan Hui , Bowen Yu Dayiheng Liu Baosong Yang Fei Huang Jun Xie.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xiangpeng Wei, Haoran Wei, Huan Lin, Tianhao Li, Pei Zhang, Xingzhang Ren, Mei Li, Yu Wan, Zhiwei Cao, Binbin Xie, Tianxiang Hu, Shangjie Li, Binyuan Hui, Bowen Yu, Dayiheng Liu, Baosong Yang, Fei Huang, Jun Xie

Abstract: Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English. Our models, alone with the instruction data and multilingual benchmark, are available at: \url{https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation}.

Submitted to arXiv on 12 Jul. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2307.06018v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

PolyLM is a multilingual large language model (LLM) that has been developed to address the limitations of existing LLMs, which are primarily focused on high-resource languages like English. This model has been trained on a massive dataset of 640 billion tokens and is available in two sizes: 1.7B and 13B. To improve its multilingual capabilities, PolyLM integrates bilingual data into its training process and adopts a curriculum learning strategy that gradually increases the proportion of non-English data during pre-training. Additionally, PolyLM introduces a novel multilingual self-instruct method that automatically generates 132.7K multilingual instructions for fine-tuning the model. To evaluate the performance of PolyLM, several existing multilingual tasks including understanding, question answering, generation, and translation were collected. Extensive experiments demonstrate that PolyLM outperforms other open-source models such as LLaMA and BLOOM in multilingual tasks while maintaining comparable performance in English. The authors provide access to the PolyLM models, along with the instruction data and a multilingual benchmark, at the following URL: [https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation](https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation). The team behind this research includes Xiangpeng Wei, Haoran Wei, Huan Lin, Tianhao Li, Pei Zhang, Xingzhang Ren, Mei Li, Yu Wan, Zhiwei Cao, Binbin Xie, Tianxiang Hu , Shangjie Li , Binyuan Hui , Bowen Yu Dayiheng Liu Baosong Yang Fei Huang Jun Xie .
Created on 25 Oct. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.