A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models

AI-generated keywords: Polyglot

AI-generated Key Points

Polyglot project aims to enhance non-English language performance of multilingual language models
Researchers and developers often build monolingual models due to dissatisfaction with current multilingual models' non-English language capabilities
Advanced multilingual language models developed for improved performance in non-English languages
Introduction of Polyglot Korean models with a specific focus on Korean language
Collaboration with TUNiB to collect 1.2TB of curated Korean data
Prioritization of Korean models before venturing into multilingual models for performance comparisons and catering to specific needs of Korean companies and researchers
Development of Polyglot-Ko model with three different sizes: 400M, 5.8B, and 12.8B parameters
The 12.8 billion parameter model is the largest publicly available Korean language model suitable for commercial applications
Assessment of zero-shot and few-shot performance using KOBEST benchmark shows competitive results across various datasets
Open-source large-scale Korean language model that improves non-English language capabilities in multilingual language models

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Hyunwoong Ko, Kichang Yang, Minho Ryu, Taekyoon Choi, Seungmu Yang, jiwung Hyun, Sungho Park

arXiv: 2306.02254v1 - DOI (cs.CL)

License: CC BY-SA 4.0

Abstract: Polyglot is a pioneering project aimed at enhancing the non-English language performance of multilingual language models. Despite the availability of various multilingual models such as mBERT (Devlin et al., 2019), XGLM (Lin et al., 2022), and BLOOM (Scao et al., 2022), researchers and developers often resort to building monolingual models in their respective languages due to the dissatisfaction with the current multilingual models non-English language capabilities. Addressing this gap, we seek to develop advanced multilingual language models that offer improved performance in non-English languages. In this paper, we introduce the Polyglot Korean models, which represent a specific focus rather than being multilingual in nature. In collaboration with TUNiB, our team collected 1.2TB of Korean data meticulously curated for our research journey. We made a deliberate decision to prioritize the development of Korean models before venturing into multilingual models. This choice was motivated by multiple factors: firstly, the Korean models facilitated performance comparisons with existing multilingual models; and finally, they catered to the specific needs of Korean companies and researchers. This paper presents our work in developing the Polyglot Korean models, which propose some steps towards addressing the non-English language performance gap in multilingual language models.

Submitted to arXiv on 04 Jun. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2306.02254v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

Polyglot is a pioneering project aimed at enhancing the non-English language performance of multilingual language models. Despite the availability of various multilingual models, researchers and developers often resort to building monolingual models in their respective languages due to dissatisfaction with the current multilingual models' non-English language capabilities. To address this gap, the authors developed advanced multilingual language models that offer improved performance in non-English languages. In this paper, the authors introduce the Polyglot Korean models, which represent a specific focus rather than being multilingual in nature. In collaboration with TUNiB, their team collected 1.2TB of Korean data meticulously curated for their research journey. They made a deliberate decision to prioritize the development of Korean models before venturing into multilingual models. This choice was motivated by multiple factors: firstly, the Korean models facilitated performance comparisons with existing multilingual models; and finally, they catered to the specific needs of Korean companies and researchers. The authors present their work in developing the Polyglot Korean models, which propose some steps towards addressing the non-English language performance gap in multilingual language models. The Polyglot-Ko model represents one of their achievements, with three different sizes available: 400M, 5.8B, and 12.8B parameters. The 12.8 billion parameter model is particularly noteworthy as it is currently the largest publicly available Korean language model suitable for commercial applications. The authors assess the zero-shot and few-shot performance of their Polyglot-Ko models using the KOBEST benchmark and successfully demonstrate competitive results across various benchmark datasets. Overall, this paper presents an open-source large-scale Korean language model that contributes to improving non-English language capabilities in multilingual language models. It provides valuable resources for researchers and practitioners engaged in Korean natural language processing tasks and showcases advancements in addressing the non-English language performance gap.

- Polyglot project aims to enhance non-English language performance of multilingual language models
- Researchers and developers often build monolingual models due to dissatisfaction with current multilingual models' non-English language capabilities
- Advanced multilingual language models developed for improved performance in non-English languages
- Introduction of Polyglot Korean models with a specific focus on Korean language
- Collaboration with TUNiB to collect 1.2TB of curated Korean data
- Prioritization of Korean models before venturing into multilingual models for performance comparisons and catering to specific needs of Korean companies and researchers
- Development of Polyglot-Ko model with three different sizes: 400M, 5.8B, and 12.8B parameters
- The 12.8 billion parameter model is the largest publicly available Korean language model suitable for commercial applications
- Assessment of zero-shot and few-shot performance using KOBEST benchmark shows competitive results across various datasets
- Open-source large-scale Korean language model that improves non-English language capabilities in multilingual language models

The Polyglot project wants to make language models better at speaking languages other than English. People who study and create these models usually focus on one language because current multilingual models aren't good enough with non-English languages. They made advanced models that are better at speaking different languages. They made special Korean models and worked with TUNiB to collect a lot of Korean data. They focused on Korean first before comparing the performance of different languages and meeting the needs of Korean companies and researchers. They made a big Korean model with 12.8 billion parts that can be used for business purposes. They tested how well it works using a benchmark called KOBEST, and it did well with different datasets. This is an open-source model that helps improve how well language models speak non-English languages. Definitions - Polyglot: Being able to speak or understand many different languages. - Multilingual: Able to speak or understand more than one language. - Model: A representation or example of something. - Parameters: Factors or variables that affect how something works or behaves. - Benchmark: A standard or reference point used for comparison or evaluation. - Open-source: Something that is freely available for anyone to use, modify, or distribute."

Polyglot: Enhancing Non-English Language Performance of Multilingual Language Models

Multilingual language models have been around for some time, but researchers and developers often resort to building monolingual models in their respective languages due to dissatisfaction with the current multilingual models' non-English language capabilities. To address this gap, a team of researchers developed advanced multilingual language models that offer improved performance in non-English languages. This research paper focuses on the Polyglot Korean model, which was developed as part of this project.

Background

The development of the Polyglot Korean model was motivated by multiple factors. Firstly, it facilitated performance comparisons with existing multilingual models; and secondly, it catered to the specific needs of Korean companies and researchers. In collaboration with TUNiB, their team collected 1.2TB of Korean data meticulously curated for their research journey. They made a deliberate decision to prioritize the development of Korean models before venturing into multilingual models.

Polyglot-Ko Model

The authors present their work in developing the Polyglot Korean model which proposes steps towards addressing the non-English language performance gap in multilingual language models. The Polyglot-Ko model represents one of their achievements, with three different sizes available: 400M, 5.8B, and 12.8B parameters – making it currently one of the largest publicly available Korean language model suitable for commercial applications.

Performance Evaluation

The authors assess the zero-shot and few-shot performance of their Polyglot-Ko models using KOBEST benchmark datasets and successfully demonstrate competitive results across various benchmark datasets such as GLUE (General Language Understanding Evaluation) tasks like CoLA (Corpus Of Linguistic Acceptability), SST (Stanford Sentiment Treebank), MRPC (Microsoft Research Paraphrase Corpus) etc., XNLI (Crosslingual Natural Language Inference) tasks like MNLI (MultiNLI), QQP (Quora Question Pairs).

Conclusion

Overall, this paper presents an open source large scale Korean language model that contributes to improving non English language capabilities in multilingual language models providing valuable resources for researchers and practitioners engaged in natural processing tasks related to Korea showcasing advancements in addressing gaps associated with non English languages .

Created on 15 Sep. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

61.5%

KLUE: Korean Language Understanding Evaluation

cs.CL

58.7%

LLM-powered Data Augmentation for Enhanced Crosslingual Performance

cs.CL

58.7%

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

cs.CL

57.7%

ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language …

cs.CL

57.6%

In-Context Retrieval-Augmented Language Models

cs.CL

57.5%

Improving language models by retrieving from trillions of tokens

cs.CL

56.4%

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.