DocFormer: End-to-End Transformer for Document Understanding

AI-generated keywords: Visual Document Understanding DocFormer multi-modal transformer pre-training cross-modality feature correlation

AI-generated Key Points

  • DocFormer has significantly advanced Visual Document Understanding (VDU) by introducing innovative pre-training methods and multi-modal interaction.
  • Models like LayoutLM, LayoutLMv2, and BROS have also contributed to advancements in document understanding through pre-training on large datasets and utilizing graph-based classifiers.
  • Recent works like Layout-T5 and TILT have evolved towards multi-modal transformer encoder-decoder architectures based on T5 for addressing cross-modality feature correlation challenges.
  • DocFormer's unique features include incorporating text, vision, and spatial features, leveraging a multi-modal self-attention layer, shared spatial embeddings, and novel pre-training tasks without relying on bulky pre-trained object-detection networks.
  • DocFormer demonstrates state-of-the-art performance on diverse datasets compared to larger models due to its emphasis on multi-modal interaction and innovative architectural design.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, R. Manmatha

License: CC BY 4.0

Abstract: We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

Submitted to arXiv on 22 Jun. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2106.11539v1

In the realm of Visual Document Understanding (VDU), the introduction of DocFormer has significantly advanced the field. VDU poses a challenging task of comprehending documents in various formats and layouts, such as forms and receipts. DocFormer stands out for its innovative approach to pre-training in an unsupervised manner using carefully designed tasks that promote multi-modal interaction. By incorporating text, vision, and spatial features and leveraging a unique multi-modal self-attention layer, DocFormer excels in correlating text to visual tokens and vice versa. The landscape of document understanding has seen notable advancements with models like LayoutLM and LayoutLMv2, which emphasize pre-training on large datasets followed by fine-tuning on specific tasks related to document processing. Additionally, BROS utilizes a BERT-based encoder with a graph-based classifier to predict entity relations within documents. The evolution towards multi-modal transformer encoder-decoder architectures based on T5 has also been observed in recent works like Layout-T5 and TILT. lies in its ability to address the challenge of cross-modality feature correlation, where mapping text descriptions to visual content proves intricate. By introducing shared spatial embeddings and novel pre-training tasks like learning-to-reconstruct and multi-modal masked language modeling, as a pioneering approach in VDU without relying on bulky pre-trained object-detection networks for visual feature extraction. In summary, is evident through its state-of-the-art performance on diverse datasets compared to larger models. Its emphasis on multi-modal interaction and innovative architectural design positions it at the forefront of Visual Document Understanding research, paving the way for further advancements in this domain.
Created on 26 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.