TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

AI-generated keywords: Text recognition Transformer Pre-training Synthetic data OCR

AI-generated Key Points

Text recognition has been a challenge in document digitalization
TrOCR is an end-to-end text recognition approach that uses pre-trained image Transformer and text Transformer models
To build a dataset for TrOCR's pre-training phase, researchers sampled 2 million document pages from publicly available PDF files and synthesized handwritten textline images using TRDG
TrOCR's pipeline involves extracting visual features from inputted textline images and predicting wordpiece tokens based on context generated before it
The TrOCR model outperformed current state of the art models in both printed and handwritten text recognition tasks according to experiments conducted using the SROIE dataset and the IAM Handwriting Database.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei

arXiv: 2109.10282v1 - DOI (cs.CL)

Work in Progress

License: CC BY-NC-SA 4.0

Abstract: Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. The code and models will be publicly available at https://aka.ms/TrOCR.

Submitted to arXiv on 21 Sep. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2109.10282v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

Text recognition has long been a challenge in the field of document digitalization. To address this issue, researchers have proposed an end-to-end text recognition approach called TrOCR which leverages pre-trained image Transformer and text Transformer models based on the Transformer architecture for both image understanding and wordpiece-level text generation. To build a large-scale high-quality dataset for TrOCR's pre-training phase, researchers sampled two million document pages from publicly available PDF files on the internet and converted them into page images to extract pretty-printed textline images. This resulted in a first-stage pre-training dataset containing 684 million textlines. For the second stage of pre-training, they used 5,427 handwritten fonts to synthesize handwritten textline images using an open source data generator called TRDG. The second stage pre-training dataset consisted of 17.9 million handwritten textlines and 3.3 million printed ones. In addition to synthetic data, researchers also collected around 53K receipt images from real world scenarios and recognized their texts using commercial OCR engines. They corrected orientation issues by cropping the receipt images' relevant parts and rotating them if necessary before pruning them down further with TRDG generated printed fonts. TrOCR's pipeline involves extracting visual features from inputted textline images and predicting wordpiece tokens based on context generated before it. During training ground truth tokens are followed by an "[EOS]" token indicating sentence completion; during inference the decoder starts with this token to predict output iteratively while taking newly generated output as its next input. The TrOCR model outperformed current state of the art models in both printed and handwritten text recognition tasks according to experiments conducted using the SROIE (Scanned Receipts OCR and Information Extraction) dataset and the IAM Handwriting Database.

- Text recognition has been a challenge in document digitalization
- TrOCR is an end-to-end text recognition approach that uses pre-trained image Transformer and text Transformer models
- To build a dataset for TrOCR's pre-training phase, researchers sampled 2 million document pages from publicly available PDF files and synthesized handwritten textline images using TRDG
- TrOCR's pipeline involves extracting visual features from inputted textline images and predicting wordpiece tokens based on context generated before it
- The TrOCR model outperformed current state of the art models in both printed and handwritten text recognition tasks according to experiments conducted using the SROIE dataset and the IAM Handwriting Database.

Summary: TrOCR is a computer program that can read words from pictures of writing. It was hard to make this program work well before, but TrOCR uses special models to do it better. The people who made TrOCR used lots of pictures of writing to teach the program how to read different styles. When someone puts a picture of writing into TrOCR, it looks at the lines and guesses what words are there based on what came before and after. Definitions - Text recognition: The ability for a computer program to "read" words from an image or document. - Transformer models: A type of artificial intelligence model that helps computers understand language by predicting the next word in a sentence based on context. - Dataset: A collection of data used for training or testing machine learning models. - Handwritten textline images: Pictures of handwriting, usually one line at a time. - State-of-the-art models: The best and most advanced computer programs currently available for a particular task.

Exploring the Potential of TrOCR: An End-to-End Text Recognition Approach

Building a Large Scale High Quality Dataset for Pre-Training

To build a large scale high quality dataset for TrOCR's pre-training phase, researchers sampled two million document pages from publicly available PDF files on the internet and converted them into page images to extract pretty-printed textline images. This resulted in a first stage pre-training dataset containing 684 million textlines. For the second stage of pre-training, they used 5,427 handwritten fonts to synthesize handwritten textline images using an open source data generator called TRDG (Text Rendering Data Generator). The second stage pre training dataset consisted of 17.9 million handwritten textlines and 3.3 million printed ones. In addition to synthetic data, researchers also collected around 53K receipt images from real world scenarios and recognized their texts using commercial OCR engines. They corrected orientation issues by cropping the receipt images' relevant parts and rotating them if necessary before pruning them down further with TRDG generated printed fonts.

TrOCR Pipeline

TrOCR's pipeline involves extracting visual features from inputted textline images and predicting wordpiece tokens based on context generated before it. During training ground truth tokens are followed by an "[EOS]" token indicating sentence completion; during inference the decoder starts with this token to predict output iteratively while taking newly generated output as its next input.

Experimental Results

The TrOCR model was tested against current state of the art models in both printed and handwritten recognition tasks using two datasets: SROIE (Scanned Receipts OCR & Information Extraction) dataset consisting of 11K receipts with more than 1M words; IAM Handwriting Database consisting of 15K lines written by over 600 writers with more than 500K words total . The results showed that TrOCR outperformed all existing methods according to accuracy metrics like edit distance (Levenshtein Distance), character error rate (CER), word error rate (WER) etc., demonstrating its potential as an effective end–to–end solution for recognizing both printed & handwriting documents accurately at scale without requiring manual labeling or segmentation steps beforehand..

Created on 15 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

64.9%

Towards Robust Handwritten Text Recognition with On-the-fly User Participation

cs.CV

60.5%

Hate speech detection using static BERT embeddings

cs.CL

59.2%

UniT: Multimodal Multitask Learning with a Unified Transformer

cs.CV

59.1%

data2vec: A General Framework for Self-supervised Learning in Speech, Vision …

cs.LG

59.0%

BERT: A Review of Applications in Natural Language Processing and Understandi…

cs.CL

58.6%

Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important To…

cs.CL

58.5%

Zero-Shot Text-to-Image Generation

cs.CV

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.