GlossLM: Multilingual Pretraining for Low-Resource Interlinear Glossing
AI-generated Key Points
- Annotated text in formats like interlinear glossed text (IGT) is crucial for detailed morphosyntactic analyses in a morpheme-by-morpheme format.
- Previous research has focused on automating the generation of IGT to streamline language analysis processes.
- Many languages, especially those needing preservation, lack sufficient IGT data for effective model training.
- Crosslingual transfer has been proposed as a solution to address the lack of IGT data for low-resource languages.
- A comprehensive corpus of over 450k IGT examples across 1.8k languages has been compiled to facilitate research on crosslingual transfer and IGT generation.
- Pretraining a large multilingual model on a portion of the corpus followed by fine-tuning demonstrates competitiveness with state-of-the-art methods for segmented data and large monolingual datasets.
- The model outperforms existing models on unsegmented text and small corpora by up to 6.6% in morpheme accuracy, showcasing the effectiveness of crosslingual transfer for low-resource languages.
- Annotated text aids in preserving minority languages by creating reference materials such as dictionaries and grammars.
- Pretrained models available through platforms like Hugging Face enhance accessibility for researchers and practitioners involved in language documentation efforts.
Authors: Michael Ginn (University of Colorado), Lindia Tjuatja (Carnegie Mellon University), Taiqi He (Carnegie Mellon University), Enora Rice (University of Colorado), Graham Neubig (Carnegie Mellon University), Alexis Palmer (University of Colorado), Lori Levin (Carnegie Mellon University)
Abstract: A key aspect of language documentation is the creation of annotated text in a format such as interlinear glossed text (IGT), which captures fine-grained morphosyntactic analyses in a morpheme-by-morpheme format. Prior work has explored methods to automatically generate IGT in order to reduce the time cost of language analysis. However, many languages (particularly those requiring preservation) lack sufficient IGT data to train effective models, and crosslingual transfer has been proposed as a method to overcome this limitation. We compile the largest existing corpus of IGT data from a variety of sources, covering over 450k examples across 1.8k languages, to enable research on crosslingual transfer and IGT generation. Then, we pretrain a large multilingual model on a portion of this corpus, and further finetune it to specific languages. Our model is competitive with state-of-the-art methods for segmented data and large monolingual datasets. Meanwhile, our model outperforms SOTA models on unsegmented text and small corpora by up to 6.6% morpheme accuracy, demonstrating the effectiveness of crosslingual transfer for low-resource languages.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.