This paper focuses on language identification models for low- and under-resourced languages in the Pacific region, particularly Austronesian languages that were previously unavailable. The goal is to develop accurate language identification systems as part of building language resources. The study combines 29 Austronesian languages with 171 non-Austronesian languages from eight data sources to create an evaluation set. After evaluating six different approaches, the researchers find that a classifier based on skip-gram embeddings outperforms other methods significantly. To further investigate the impact of increasing the inventory of non-Austronesian languages, the researchers systematically increase the number of languages in the model up to a total of 800. Surprisingly, they find that there is only a minimal decrease in accuracy when expanding the inventory, suggesting that including more non-Austronesian languages does not negatively affect predictions for Austronesian languages. The paper also explores adapting these language identification models for code-switching detection and achieves high accuracy across all 29 languages studied. Additionally, the performance and stability of models after compression are evaluated. In terms of data collection, the researchers prioritize maintaining diversity across various domains rather than reducing sample size. They argue that documents with fewer than 100 characters are less useful for corpus-building purposes. The paper provides a detailed overview of their approach, including information on how they selected non-Austronesian languages for their initial model based on genetic classifications and previous work on digital language mapping. They compare their inventory of 200 languages with popular language identification packages such as Google's CLD32 and langid.py3. Overall, this research contributes to improving language identification models for low-resource Pacific languages by demonstrating the effectiveness of skip-gram embeddings and exploring the impact of increasing non-Austronesian language inventories. The findings have implications for developing accurate code-switching detection systems as well.
- - Language identification models for low- and under-resourced languages in the Pacific region, particularly Austronesian languages
- - Goal is to develop accurate language identification systems as part of building language resources
- - Evaluation set includes 29 Austronesian languages and 171 non-Austronesian languages from eight data sources
- - Skip-gram embeddings outperform other methods significantly in language identification
- - Minimal decrease in accuracy when expanding the inventory of non-Austronesian languages
- - Adapting language identification models for code-switching detection with high accuracy across all 29 languages studied
- - Performance and stability of models after compression evaluated
- - Prioritizing diversity in data collection rather than reducing sample size
- - Detailed overview of approach, including selection of non-Austronesian languages based on genetic classifications and previous work on digital language mapping
- - Comparison of inventory of 200 languages with popular language identification packages such as Google's CLD32 and langid.py3
- - Research contributes to improving language identification models for low-resource Pacific languages by demonstrating effectiveness of skip-gram embeddings and exploring impact of increasing non-Austronesian language inventories
Summary: Researchers are working on identifying languages in the Pacific region, especially Austronesian languages. They want to create accurate systems for identifying languages and gather language resources. They tested 29 Austronesian languages and 171 non-Austronesian languages from different sources. Skip-gram embeddings were found to be the best method for language identification. Adding more non-Austronesian languages did not decrease accuracy much. The researchers also adapted the language identification models for detecting code-switching in all 29 languages.
Definitions- Language identification: The process of determining which language is being spoken or written.
- Austronesian languages: A group of related languages spoken in the Pacific region, including Indonesian, Tagalog, and Hawaiian.
- Accuracy: How correct or precise something is.
- Embeddings: Representations of words or phrases as numerical vectors.
- Inventory: A list or collection of things.
- Code-switching: Switching between two or more languages within a conversation or text.
Language Identification Models for Low- and Under-Resourced Languages in the Pacific Region
The Pacific region is home to a wide variety of low- and under-resourced languages, particularly Austronesian languages. Developing accurate language identification systems is essential for building language resources, yet there has been limited research on this topic. To address this gap, researchers from the University of Hawai’i at Manoa recently published a paper that explores language identification models for Austronesian languages in the Pacific region.
Data Collection and Evaluation Set
The study combines 29 Austronesian languages with 171 non-Austronesian languages from eight data sources to create an evaluation set. The researchers prioritize maintaining diversity across various domains rather than reducing sample size; documents with fewer than 100 characters are less useful for corpus-building purposes. To select non-Austronesian languages for their initial model, they used genetic classifications and previous work on digital language mapping.
Model Development
The researchers evaluated six different approaches: ngrams (N), skipgrams (S), character embeddings (C), convolutional neural networks (CNN), recurrent neural networks (RNN) and bidirectional long short term memory networks (BLSTM). They found that a classifier based on skipgram embeddings outperformed other methods significantly. Additionally, they systematically increased the number of languages in the model up to 800 total; surprisingly, there was only a minimal decrease in accuracy when expanding the inventory, suggesting that including more non-Austronesian languages does not negatively affect predictions for Austronesian languages.
Adaptation and Compression Performance
The paper also explores adapting these language identification models for code-switching detection and achieves high accuracy across all 29 studied languages. In terms of data compression performance, they evaluate how well compressed models perform compared to uncompressed ones; results show that compressed models maintain similar accuracy levels as uncompressed ones while using much less space on disk or memory during runtime operations.
Comparison with Popular Language Identification Packages
Finally, the researchers compare their inventory of 200 languages with popular language identification packages such as Google's CLD32 and langid.py3; results indicate that their approach performs better overall than existing packages due to its larger inventory size and use of skipgram embeddings instead of ngrams or character embeddings alone.
Conclusion
Overall, this research contributes to improving language identification models for low resource Pacific languages by demonstrating the effectiveness of skip gram embeddings and exploring the impact of increasing non Austronesian language inventories. The findings have implications for developing accurate code switching detection systems as well.