TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

AI-generated keywords: Text recognition Transformer Pre-training Synthetic data OCR

AI-generated Key Points

  • Text recognition has been a challenge in document digitalization
  • TrOCR is an end-to-end text recognition approach that uses pre-trained image Transformer and text Transformer models
  • To build a dataset for TrOCR's pre-training phase, researchers sampled 2 million document pages from publicly available PDF files and synthesized handwritten textline images using TRDG
  • TrOCR's pipeline involves extracting visual features from inputted textline images and predicting wordpiece tokens based on context generated before it
  • The TrOCR model outperformed current state of the art models in both printed and handwritten text recognition tasks according to experiments conducted using the SROIE dataset and the IAM Handwriting Database.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei

Work in Progress
License: CC BY-NC-SA 4.0

Abstract: Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. The code and models will be publicly available at https://aka.ms/TrOCR.

Submitted to arXiv on 21 Sep. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2109.10282v1

Text recognition has long been a challenge in the field of document digitalization. To address this issue, researchers have proposed an end-to-end text recognition approach called TrOCR which leverages pre-trained image Transformer and text Transformer models based on the Transformer architecture for both image understanding and wordpiece-level text generation. To build a large-scale high-quality dataset for TrOCR's pre-training phase, researchers sampled two million document pages from publicly available PDF files on the internet and converted them into page images to extract pretty-printed textline images. This resulted in a first-stage pre-training dataset containing 684 million textlines. For the second stage of pre-training, they used 5,427 handwritten fonts to synthesize handwritten textline images using an open source data generator called TRDG. The second stage pre-training dataset consisted of 17.9 million handwritten textlines and 3.3 million printed ones. In addition to synthetic data, researchers also collected around 53K receipt images from real world scenarios and recognized their texts using commercial OCR engines. They corrected orientation issues by cropping the receipt images' relevant parts and rotating them if necessary before pruning them down further with TRDG generated printed fonts. TrOCR's pipeline involves extracting visual features from inputted textline images and predicting wordpiece tokens based on context generated before it. During training ground truth tokens are followed by an "[EOS]" token indicating sentence completion; during inference the decoder starts with this token to predict output iteratively while taking newly generated output as its next input. The TrOCR model outperformed current state of the art models in both printed and handwritten text recognition tasks according to experiments conducted using the SROIE (Scanned Receipts OCR and Information Extraction) dataset and the IAM Handwriting Database.
Created on 15 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.