LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

AI-generated keywords: Visually-Rich Document Understanding Multimodal Pre-Training LayoutXLM XFUN Dataset Contrastive Learning

AI-generated Key Points

  • Multimodal pre-training has advanced visually-rich document understanding (VrDU)
  • Joint learning across different modalities has potential for achieving state-of-the-art performance on VrDU tasks
  • Microsoft Research Asia and Microsoft Azure AI have introduced LayoutXLM, a multilingual pre-trained model for VrDU in different languages
  • LayoutXLM is evaluated using XFUN, a new multilingual form understanding benchmark dataset with key-value labeled forms in seven languages
  • Experiment results show that LayoutXLM outperforms existing cross-lingual pre-trained models on the XFUN dataset
  • The pre-trained LayoutXLM model and XFUN dataset are publicly available to advance research in document understanding
  • Other related works discussed include mBART and mT5 based on contrastive learning and multilingual variants of T5 respectively
  • Future research plans include enlarging the multilingual training data to cover more languages as well as more document layouts and templates, and investigating how to leverage the co-occurrence of business documents with similar content but in different languages to improve accuracy and performance.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei

Work in progress
License: CC BY-NC-SA 4.0

Abstract: Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also introduce a multilingual form understanding benchmark dataset named XFUN, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUN dataset. The pre-trained LayoutXLM model and the XFUN dataset will be publicly available at https://aka.ms/layoutxlm.

Submitted to arXiv on 18 Apr. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2104.08836v1

The field of visually-rich document understanding (VrDU) has seen significant advancements in recent times with the use of multimodal pre-training that incorporates text, layout, and image information. This joint learning across different modalities has demonstrated great potential for achieving state-of-the-art (SOTA) performance on VrDU tasks. To further bridge the language barriers for VrDU, Microsoft Research Asia and Microsoft Azure AI have introduced LayoutXLM, a multilingual pre-trained model that aims to accurately understand visually-rich documents in different languages. The LayoutXLM model is evaluated using a new multilingual form understanding benchmark dataset named XFUN, which includes key-value labeled forms in seven languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese). Experiment results show that LayoutXLM significantly outperforms existing SOTA cross-lingual pre-trained models on the XFUN dataset. The pre-trained LayoutXLM model and XFUN dataset are publicly available to advance research in document understanding. The paper also discusses other related works such as mBART and mT5 that are based on contrastive learning and multilingual variants of T5 respectively. For future research, the authors plan to enlarge the multilingual training data to cover more languages as well as more document layouts and templates. They also aim to investigate how to leverage the co-occurrence of business documents with similar content but in different languages in order to improve accuracy and performance.
Created on 24 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.