LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

AI-generated keywords: Visually-Rich Document Understanding Multimodal Pre-Training LayoutXLM XFUN Dataset Contrastive Learning

AI-generated Key Points

Multimodal pre-training has advanced visually-rich document understanding (VrDU)
Joint learning across different modalities has potential for achieving state-of-the-art performance on VrDU tasks
Microsoft Research Asia and Microsoft Azure AI have introduced LayoutXLM, a multilingual pre-trained model for VrDU in different languages
LayoutXLM is evaluated using XFUN, a new multilingual form understanding benchmark dataset with key-value labeled forms in seven languages
Experiment results show that LayoutXLM outperforms existing cross-lingual pre-trained models on the XFUN dataset
The pre-trained LayoutXLM model and XFUN dataset are publicly available to advance research in document understanding
Other related works discussed include mBART and mT5 based on contrastive learning and multilingual variants of T5 respectively
Future research plans include enlarging the multilingual training data to cover more languages as well as more document layouts and templates, and investigating how to leverage the co-occurrence of business documents with similar content but in different languages to improve accuracy and performance.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei

arXiv: 2104.08836v1 - DOI (cs.CL)

Work in progress

License: CC BY-NC-SA 4.0

Abstract: Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also introduce a multilingual form understanding benchmark dataset named XFUN, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUN dataset. The pre-trained LayoutXLM model and the XFUN dataset will be publicly available at https://aka.ms/layoutxlm.

Submitted to arXiv on 18 Apr. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2104.08836v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

The field of visually-rich document understanding (VrDU) has seen significant advancements in recent times with the use of multimodal pre-training that incorporates text, layout, and image information. This joint learning across different modalities has demonstrated great potential for achieving state-of-the-art (SOTA) performance on VrDU tasks. To further bridge the language barriers for VrDU, Microsoft Research Asia and Microsoft Azure AI have introduced LayoutXLM, a multilingual pre-trained model that aims to accurately understand visually-rich documents in different languages. The LayoutXLM model is evaluated using a new multilingual form understanding benchmark dataset named XFUN, which includes key-value labeled forms in seven languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese). Experiment results show that LayoutXLM significantly outperforms existing SOTA cross-lingual pre-trained models on the XFUN dataset. The pre-trained LayoutXLM model and XFUN dataset are publicly available to advance research in document understanding. The paper also discusses other related works such as mBART and mT5 that are based on contrastive learning and multilingual variants of T5 respectively. For future research, the authors plan to enlarge the multilingual training data to cover more languages as well as more document layouts and templates. They also aim to investigate how to leverage the co-occurrence of business documents with similar content but in different languages in order to improve accuracy and performance.

- Multimodal pre-training has advanced visually-rich document understanding (VrDU)
- Joint learning across different modalities has potential for achieving state-of-the-art performance on VrDU tasks
- Microsoft Research Asia and Microsoft Azure AI have introduced LayoutXLM, a multilingual pre-trained model for VrDU in different languages
- LayoutXLM is evaluated using XFUN, a new multilingual form understanding benchmark dataset with key-value labeled forms in seven languages
- Experiment results show that LayoutXLM outperforms existing cross-lingual pre-trained models on the XFUN dataset
- The pre-trained LayoutXLM model and XFUN dataset are publicly available to advance research in document understanding
- Other related works discussed include mBART and mT5 based on contrastive learning and multilingual variants of T5 respectively
- Future research plans include enlarging the multilingual training data to cover more languages as well as more document layouts and templates, and investigating how to leverage the co-occurrence of business documents with similar content but in different languages to improve accuracy and performance.

Error: needs to be re-run

Understanding Visually-Rich Documents with LayoutXLM

The field of visually-rich document understanding (VrDU) has seen significant advancements in recent times. Multimodal pre-training that incorporates text, layout, and image information has demonstrated great potential for achieving state-of-the-art (SOTA) performance on VrDU tasks. To bridge the language barriers for VrDU, Microsoft Research Asia and Microsoft Azure AI have introduced LayoutXLM, a multilingual pre-trained model that aims to accurately understand visually-rich documents in different languages.

LayoutXLM Model

LayoutXLM is evaluated using a new multilingual form understanding benchmark dataset named XFUN which includes key-value labeled forms in seven languages: Chinese, Japanese, Spanish, French, Italian, German and Portuguese. Experiment results show that LayoutXLM significantly outperforms existing SOTA cross-lingual pre-trained models on the XFUN dataset. The pre-trained LayoutXLM model and XFUN dataset are publicly available to advance research in document understanding.

Related Works

The paper also discusses other related works such as mBART and mT5 which are based on contrastive learning and multilingual variants of T5 respectively.

Future Research

For future research, the authors plan to enlarge the multilingual training data to cover more languages as well as more document layouts and templates. They also aim to investigate how to leverage the co-occurrence of business documents with similar content but in different languages in order to improve accuracy and performance. Overall, this research paper presents an important advancement for VrDU by introducing a multilingual pre-trained model called LayoutXLm along with a new benchmark dataset called XFUN which can be used by researchers all over the world for further development in this field.

Created on 24 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

60.5%

LLM-powered Data Augmentation for Enhanced Crosslingual Performance

cs.CL

60.5%

A Survey of Multilingual Models for Automatic Speech Recognition

cs.CL

59.6%

ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language …

cs.CL

58.3%

TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in …

cs.CL

57.4%

Translate to Disambiguate: Zero-shot Multilingual Word Sense Disambiguation w…

cs.CL

57.4%

Benchmarking Large Language Models for News Summarization

cs.CL

57.2%

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.