XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

AI-generated keywords: XLM-V

AI-generated Key Points

  • Large multilingual language models, such as XLM-R, have a single vocabulary shared across more than 100 languages.
  • The vocabulary size has not kept up with the growth in model size and complexity, creating a "vocabulary bottleneck."
  • The authors propose a new approach called XLM-V to overcome this limitation by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity for each individual language.
  • XLM-V is a multilingual language model with a one million token vocabulary that outperforms XLM-R on various tasks including natural language inference, question answering, and named entity recognition.
  • XLM-V performs exceptionally well on low-resource language tasks and shows significant improvements compared to XLM-R.
  • The paper introduces the concept of average log probability (ALP) to evaluate the ability of a vocabulary to represent a particular language.
  • A greedy algorithm is proposed to determine the desired vocabulary capacity for individual languages based on ALP.
  • The authors train individual monolingual sentencepiece models for each language using the Unigram Language Model algorithm and cluster them using K-Means clustering to construct multilingual vocabularies.
  • Vocabulary capacities are assigned to each cluster based on ALP, resulting in per-cluster vocabularies.
  • This research presents an innovative approach for scaling multilingual vocabularies and demonstrates improved performance compared to existing models like XLM-R.
  • The proposed methodology for vocabulary allocation provides a systematic way to optimize vocabulary capacity for individual languages.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa

EMNLP 2023
License: CC BY-SA 4.0

Abstract: Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This \textit{vocabulary bottleneck} limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), to named entity recognition (WikiAnn). XLM-V is particularly effective on low-resource language tasks and outperforms XLM-R by 11.2% and 5.8% absolute on MasakhaNER and Americas NLI, respectively.

Submitted to arXiv on 25 Jan. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2301.10472v2

Large multilingual language models, such as XLM-R, typically rely on a single vocabulary shared across more than 100 languages. However, as these models have grown in size and complexity, the vocabulary size has remained largely unchanged. This creates a "vocabulary bottleneck" that limits the representational capabilities of these models. In this paper, the authors propose a new approach to overcome this limitation by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to ensure sufficient coverage for each individual language. The authors introduce XLM-V, a multilingual language model with a one million token vocabulary. They demonstrate that XLM-V outperforms XLM-R on various tasks including natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and named entity recognition (WikiAnn). Notably, XLM-V performs exceptionally well on low-resource language tasks and achieves an absolute improvement of 11.2% and 5.8% on MasakhaNER and Americas NLI respectively compared to XLM-R. The paper also discusses the issue of vocabulary allocation and introduces the concept of average log probability (ALP) to evaluate the ability of a vocabulary to represent a particular language. The authors propose a greedy algorithm to determine the desired vocabulary capacity for individual languages in the multilingual vocabulary based on ALP. To construct the multilingual vocabularies, the authors train individual monolingual sentencepiece models for each language using the Unigram Language Model algorithm. They then use per-language vocabularies to construct lexical representation vectors and cluster them using K-Means clustering. Vocabulary capacities are assigned to each cluster based on ALP, resulting in per-cluster vocabularies. Overall, this paper presents an innovative approach for scaling multilingual vocabularies and demonstrates its effectiveness through improved performance on various tasks compared to existing models like XLM-R. The proposed methodology for vocabulary allocation provides a systematic way to optimize vocabulary capacity for individual languages. This research has significant implications for improving the representational capabilities of multilingual language models and enhancing their performance on diverse linguistic tasks.
Created on 20 Nov. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.