LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding
AI-generated Key Points
- Multimodal pre-training has advanced visually-rich document understanding (VrDU)
- Joint learning across different modalities has potential for achieving state-of-the-art performance on VrDU tasks
- Microsoft Research Asia and Microsoft Azure AI have introduced LayoutXLM, a multilingual pre-trained model for VrDU in different languages
- LayoutXLM is evaluated using XFUN, a new multilingual form understanding benchmark dataset with key-value labeled forms in seven languages
- Experiment results show that LayoutXLM outperforms existing cross-lingual pre-trained models on the XFUN dataset
- The pre-trained LayoutXLM model and XFUN dataset are publicly available to advance research in document understanding
- Other related works discussed include mBART and mT5 based on contrastive learning and multilingual variants of T5 respectively
- Future research plans include enlarging the multilingual training data to cover more languages as well as more document layouts and templates, and investigating how to leverage the co-occurrence of business documents with similar content but in different languages to improve accuracy and performance.
Authors: Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei
Abstract: Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also introduce a multilingual form understanding benchmark dataset named XFUN, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUN dataset. The pre-trained LayoutXLM model and the XFUN dataset will be publicly available at https://aka.ms/layoutxlm.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.