A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models
AI-generated Key Points
- Polyglot project aims to enhance non-English language performance of multilingual language models
- Researchers and developers often build monolingual models due to dissatisfaction with current multilingual models' non-English language capabilities
- Advanced multilingual language models developed for improved performance in non-English languages
- Introduction of Polyglot Korean models with a specific focus on Korean language
- Collaboration with TUNiB to collect 1.2TB of curated Korean data
- Prioritization of Korean models before venturing into multilingual models for performance comparisons and catering to specific needs of Korean companies and researchers
- Development of Polyglot-Ko model with three different sizes: 400M, 5.8B, and 12.8B parameters
- The 12.8 billion parameter model is the largest publicly available Korean language model suitable for commercial applications
- Assessment of zero-shot and few-shot performance using KOBEST benchmark shows competitive results across various datasets
- Open-source large-scale Korean language model that improves non-English language capabilities in multilingual language models
Authors: Hyunwoong Ko, Kichang Yang, Minho Ryu, Taekyoon Choi, Seungmu Yang, jiwung Hyun, Sungho Park
Abstract: Polyglot is a pioneering project aimed at enhancing the non-English language performance of multilingual language models. Despite the availability of various multilingual models such as mBERT (Devlin et al., 2019), XGLM (Lin et al., 2022), and BLOOM (Scao et al., 2022), researchers and developers often resort to building monolingual models in their respective languages due to the dissatisfaction with the current multilingual models non-English language capabilities. Addressing this gap, we seek to develop advanced multilingual language models that offer improved performance in non-English languages. In this paper, we introduce the Polyglot Korean models, which represent a specific focus rather than being multilingual in nature. In collaboration with TUNiB, our team collected 1.2TB of Korean data meticulously curated for our research journey. We made a deliberate decision to prioritize the development of Korean models before venturing into multilingual models. This choice was motivated by multiple factors: firstly, the Korean models facilitated performance comparisons with existing multilingual models; and finally, they catered to the specific needs of Korean companies and researchers. This paper presents our work in developing the Polyglot Korean models, which propose some steps towards addressing the non-English language performance gap in multilingual language models.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.