PolyLM: An Open Source Polyglot Large Language Model

AI-generated keywords: PolyLM

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

PolyLM is a multilingual large language model (LLM) developed to address limitations of existing LLMs focused on high-resource languages like English.
It has been trained on a massive dataset of 640 billion tokens and is available in two sizes: 1.7B and 13B.
PolyLM integrates bilingual data into its training process and uses curriculum learning to gradually increase non-English data during pre-training.
It introduces a novel multilingual self-instruct method that generates 132.7K multilingual instructions for fine-tuning the model.
PolyLM outperforms other open-source models like LLaMA and BLOOM in multilingual tasks while maintaining comparable performance in English.
Access to the PolyLM models, instruction data, and a multilingual benchmark can be found at [https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation](https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation).
The research team includes Xiangpeng Wei, Haoran Wei, Huan Lin, Tianhao Li, Pei Zhang, Xingzhang Ren, Mei Li, Yu Wan, Zhiwei Cao, Binbin Xie, Tianxiang Hu , Shangjie Li , Binyuan Hui , Bowen Yu Dayiheng Liu Baosong Yang Fei Huang Jun Xie.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xiangpeng Wei, Haoran Wei, Huan Lin, Tianhao Li, Pei Zhang, Xingzhang Ren, Mei Li, Yu Wan, Zhiwei Cao, Binbin Xie, Tianxiang Hu, Shangjie Li, Binyuan Hui, Bowen Yu, Dayiheng Liu, Baosong Yang, Fei Huang, Jun Xie

arXiv: 2307.06018v1 - DOI (cs.CL)

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English. Our models, alone with the instruction data and multilingual benchmark, are available at: \url{https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation}.

Submitted to arXiv on 12 Jul. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2307.06018v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

PolyLM is a multilingual large language model (LLM) that has been developed to address the limitations of existing LLMs, which are primarily focused on high-resource languages like English. This model has been trained on a massive dataset of 640 billion tokens and is available in two sizes: 1.7B and 13B. To improve its multilingual capabilities, PolyLM integrates bilingual data into its training process and adopts a curriculum learning strategy that gradually increases the proportion of non-English data during pre-training. Additionally, PolyLM introduces a novel multilingual self-instruct method that automatically generates 132.7K multilingual instructions for fine-tuning the model. To evaluate the performance of PolyLM, several existing multilingual tasks including understanding, question answering, generation, and translation were collected. Extensive experiments demonstrate that PolyLM outperforms other open-source models such as LLaMA and BLOOM in multilingual tasks while maintaining comparable performance in English. The authors provide access to the PolyLM models, along with the instruction data and a multilingual benchmark, at the following URL: [https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation](https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation). The team behind this research includes Xiangpeng Wei, Haoran Wei, Huan Lin, Tianhao Li, Pei Zhang, Xingzhang Ren, Mei Li, Yu Wan, Zhiwei Cao, Binbin Xie, Tianxiang Hu , Shangjie Li , Binyuan Hui , Bowen Yu Dayiheng Liu Baosong Yang Fei Huang Jun Xie .

- PolyLM is a multilingual large language model (LLM) developed to address limitations of existing LLMs focused on high-resource languages like English.
- It has been trained on a massive dataset of 640 billion tokens and is available in two sizes: 1.7B and 13B.
- PolyLM integrates bilingual data into its training process and uses curriculum learning to gradually increase non-English data during pre-training.
- It introduces a novel multilingual self-instruct method that generates 132.7K multilingual instructions for fine-tuning the model.
- PolyLM outperforms other open-source models like LLaMA and BLOOM in multilingual tasks while maintaining comparable performance in English.
- Access to the PolyLM models, instruction data, and a multilingual benchmark can be found at [https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation](https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation).
- The research team includes Xiangpeng Wei, Haoran Wei, Huan Lin, Tianhao Li, Pei Zhang, Xingzhang Ren, Mei Li, Yu Wan, Zhiwei Cao, Binbin Xie, Tianxiang Hu , Shangjie Li , Binyuan Hui , Bowen Yu Dayiheng Liu Baosong Yang Fei Huang Jun Xie.

PolyLM is a special computer program that can understand and use many different languages. It has been trained on a lot of information, like words and sentences, to help it learn how to communicate in different languages. PolyLM comes in two sizes: 1.7B and 13B, which means there are different versions of the program with different abilities. It uses a special method called curriculum learning to gradually learn more about non-English languages during its training process. PolyLM is better than other similar programs when it comes to understanding and using multiple languages, but it is also good at English too. If you want to learn more about PolyLM or try it out yourself, you can visit a website called [https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation](https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation).

Introducing PolyLM: A Multilingual Large Language Model

In recent years, the development of language models has been a major focus in natural language processing (NLP). However, most existing large language models (LLMs) are primarily focused on high-resource languages like English. To address this limitation, researchers from Alibaba Group have developed PolyLM – a multilingual LLM that is trained on a massive dataset of 640 billion tokens and is available in two sizes: 1.7B and 13B.

Improving Multilingual Capabilities

To improve its multilingual capabilities, PolyLM integrates bilingual data into its training process and adopts a curriculum learning strategy that gradually increases the proportion of non-English data during pre-training. Additionally, it introduces a novel multilingual self-instruct method that automatically generates 132.7K multilingual instructions for fine-tuning the model.

Evaluating Performance

To evaluate the performance of PolyLM, several existing multilingual tasks including understanding, question answering, generation, and translation were collected. Extensive experiments demonstrate that PolyLM outperforms other open-source models such as LLaMA and BLOOM in multilingual tasks while maintaining comparable performance in English.

Accessibility

The authors provide access to the PolyLM models along with instruction data and a multilingual benchmark at [https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation](https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation). The team behind this research includes Xiangpeng Wei, Haoran Wei, Huan Lin , Tianhao Li , Pei Zhang , Xingzhang Ren , Mei Li , Yu Wan , Zhiwei Cao , Binbin Xie , Tianxiang Hu , Shangjie Li , Binyuan Hui , Bowen Yu Dayiheng Liu Baosong Yang Fei Huang Jun Xie .

Conclusion

PolyLM is an impressive new development in NLP technology which promises to revolutionize how we use language models across multiple languages by providing improved accuracy compared to existing LLMs while still maintaining comparable performance for English tasks. With its accessibility through an online URL provided by the authors along with instruction data and benchmark tests included within it as well as its impressive team of developers behind it – there’s no doubt that this new model will be making waves in the world of NLP soon!

Created on 25 Oct. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

⚠The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

82.0%

Large language models effectively leverage document-level context for literar…

cs.CL

80.9%

A Survey on Multimodal Large Language Models

cs.CV

80.5%

A Survey of Large Language Models

cs.CL

80.5%

Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

cs.CL

79.0%

Augmented Language Models: a Survey

cs.CL

78.0%

Building Cooperative Embodied Agents Modularly with Large Language Models

cs.AI

78.0%

CodeGen2: Lessons for Training LLMs on Programming and Natural Languages

cs.LG

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.