Large multilingual language models, such as XLM-R, typically rely on a single vocabulary shared across more than 100 languages. However, as these models have grown in size and complexity, the vocabulary size has remained largely unchanged. This creates a "vocabulary bottleneck" that limits the representational capabilities of these models. In this paper, the authors propose a new approach to overcome this limitation by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to ensure sufficient coverage for each individual language. The authors introduce XLM-V, a multilingual language model with a one million token vocabulary. They demonstrate that XLM-V outperforms XLM-R on various tasks including natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and named entity recognition (WikiAnn). Notably, XLM-V performs exceptionally well on low-resource language tasks and achieves an absolute improvement of 11.2% and 5.8% on MasakhaNER and Americas NLI respectively compared to XLM-R. The paper also discusses the issue of vocabulary allocation and introduces the concept of average log probability (ALP) to evaluate the ability of a vocabulary to represent a particular language. The authors propose a greedy algorithm to determine the desired vocabulary capacity for individual languages in the multilingual vocabulary based on ALP. To construct the multilingual vocabularies, the authors train individual monolingual sentencepiece models for each language using the Unigram Language Model algorithm. They then use per-language vocabularies to construct lexical representation vectors and cluster them using K-Means clustering. Vocabulary capacities are assigned to each cluster based on ALP, resulting in per-cluster vocabularies. Overall, this paper presents an innovative approach for scaling multilingual vocabularies and demonstrates its effectiveness through improved performance on various tasks compared to existing models like XLM-R. The proposed methodology for vocabulary allocation provides a systematic way to optimize vocabulary capacity for individual languages. This research has significant implications for improving the representational capabilities of multilingual language models and enhancing their performance on diverse linguistic tasks.
- - Large multilingual language models, such as XLM-R, have a single vocabulary shared across more than 100 languages.
- - The vocabulary size has not kept up with the growth in model size and complexity, creating a "vocabulary bottleneck."
- - The authors propose a new approach called XLM-V to overcome this limitation by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity for each individual language.
- - XLM-V is a multilingual language model with a one million token vocabulary that outperforms XLM-R on various tasks including natural language inference, question answering, and named entity recognition.
- - XLM-V performs exceptionally well on low-resource language tasks and shows significant improvements compared to XLM-R.
- - The paper introduces the concept of average log probability (ALP) to evaluate the ability of a vocabulary to represent a particular language.
- - A greedy algorithm is proposed to determine the desired vocabulary capacity for individual languages based on ALP.
- - The authors train individual monolingual sentencepiece models for each language using the Unigram Language Model algorithm and cluster them using K-Means clustering to construct multilingual vocabularies.
- - Vocabulary capacities are assigned to each cluster based on ALP, resulting in per-cluster vocabularies.
- - This research presents an innovative approach for scaling multilingual vocabularies and demonstrates improved performance compared to existing models like XLM-R.
- - The proposed methodology for vocabulary allocation provides a systematic way to optimize vocabulary capacity for individual languages.
Large multilingual language models like XLM-R have a big collection of words that they can understand in more than 100 different languages. But sometimes, the number of words they know is not enough for the size and complexity of the model, which causes a problem called "vocabulary bottleneck." The authors came up with a new idea called XLM-V to solve this problem. XLM-V is also a multilingual language model, but it has one million words in its vocabulary and it performs better than XLM-R on different tasks like understanding sentences, answering questions, and recognizing names. It is especially good at understanding languages that don't have many resources available. The authors used a special way to decide how many words each language should have in the vocabulary by using something called average log probability (ALP). They trained separate models for each language and then combined them together based on similarities using K-Means clustering. This research shows a new way to make multilingual models better and gives us a way to decide how many words each language needs in the model's vocabulary."
Definitions- Multilingual: Being able to understand or use more than one language.
- Vocabulary: A collection of all the words that someone knows or understands.
- Complexity: Something that is complicated or difficult to understand.
- Bottleneck: A situation where progress or movement is slowed down or blocked.
- Lexical overlap: When two languages share some similar words or phrases.
- Natural Language Inference: Understanding what someone means
Introduction to XLM-V: A Multilingual Language Model with One Million Token Vocabulary
Large multilingual language models, such as XLM-R, have become increasingly popular for their ability to represent multiple languages in a single model. However, these models typically rely on a single vocabulary shared across more than 100 languages, which creates a “vocabulary bottleneck” that limits the representational capabilities of these models. In this paper, the authors propose an innovative approach to overcome this limitation by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to ensure sufficient coverage for each individual language. The authors introduce XLM-V, a multilingual language model with a one million token vocabulary. They demonstrate that XLM-V outperforms XLM-R on various tasks including natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and named entity recognition (WikiAnn). Notably, XLM-V performs exceptionally well on low-resource language tasks and achieves an absolute improvement of 11.2% and 5.8% on MasakhaNER and Americas NLI respectively compared to XLM-R.
Overview of Proposed Methodology
The proposed methodology consists of two main components: constructing per-language vocabularies using Unigram Language Model algorithm; and determining the desired vocabulary capacity for individual languages in the multilingual vocabulary based on Average Log Probability (ALP). The authors train individual monolingual sentencepiece models for each language using Unigram Language Model algorithm which is used to construct lexical representation vectors from the text corpus corresponding to each language. These vectors are then clustered using K Means clustering into different clusters based on similarity scores between them. Vocabulary capacities are assigned to each cluster based on ALP resulting in per cluster vocabularies which are then merged together into one large multilingual vocabulary containing one million tokens or words from all languages combined together.
Evaluation Results
The authors evaluate their proposed method by comparing its performance against existing models like XLM-R across various tasks including natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA) and named entity recognition (WikiAnn). The results show that overallXLM-V outperformsXLMRonalltaskswiththehighestimprovementbeingobservedinlowresourcetaskssuchasMasakhaNERandAmericasNLIwhereitachievesanabsoluteimprovementof11.2%and5.8%,respectivelycomparedtoXLM−R.. Not only does it perform better than existing methods but also shows improved performance when tested against low resource languages like MasakhaNER where it achieved an absolute improvement of 11%.
Conclusion
Overall this paper presents an innovative approach for scaling multilingual vocabularies through improved performance on various tasks compared to existing models like XLMR . The proposed methodology provides a systematic way to optimize vocabulary capacity for individual languages by introducing average log probability(ALP) as evaluation metric along with greedy algorithm for determining desired allocation size . This research has significant implications for improving the representational capabilities of multilingual language models and enhancing their performance on diverse linguistic tasks