Language Identification for Austronesian Languages
AI-generated Key Points
- Language identification models for low- and under-resourced languages in the Pacific region, particularly Austronesian languages
- Goal is to develop accurate language identification systems as part of building language resources
- Evaluation set includes 29 Austronesian languages and 171 non-Austronesian languages from eight data sources
- Skip-gram embeddings outperform other methods significantly in language identification
- Minimal decrease in accuracy when expanding the inventory of non-Austronesian languages
- Adapting language identification models for code-switching detection with high accuracy across all 29 languages studied
- Performance and stability of models after compression evaluated
- Prioritizing diversity in data collection rather than reducing sample size
- Detailed overview of approach, including selection of non-Austronesian languages based on genetic classifications and previous work on digital language mapping
- Comparison of inventory of 200 languages with popular language identification packages such as Google's CLD32 and langid.py3
- Research contributes to improving language identification models for low-resource Pacific languages by demonstrating effectiveness of skip-gram embeddings and exploring impact of increasing non-Austronesian language inventories
Authors: Jonathan Dunn, Wikke Nijhof
Abstract: This paper provides language identification models for low- and under-resourced languages in the Pacific region with a focus on previously unavailable Austronesian languages. Accurate language identification is an important part of developing language resources. The approach taken in this paper combines 29 Austronesian languages with 171 non-Austronesian languages to create an evaluation set drawn from eight data sources. After evaluating six approaches to language identification, we find that a classifier based on skip-gram embeddings reaches a significantly higher performance than alternate methods. We then systematically increase the number of non-Austronesian languages in the model up to a total of 800 languages to evaluate whether an increased language inventory leads to less precise predictions for the Austronesian languages of interest. This evaluation finds that there is only a minimal impact on accuracy caused by increasing the inventory of non-Austronesian languages. Further experiments adapt these language identification models for code-switching detection, achieving high accuracy across all 29 languages.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.