DeepSeek-OCR: Contexts Optical Compression

AI-generated keywords: DeepSeek-OCR

AI-generated Key Points

DeepSeek-OCR's deep parsing abilities allow it to analyze images within documents through secondary model calls
The model can extract structured information from various types of images such as charts, natural images, chemical formulas, and geometric figures with just one unified prompt
DeepSeek-OCR showcases versatility by performing deep parsing on financial charts, natural images, chemical formulas, and planar geometric figures
It demonstrates impressive multilingual recognition proficiency by handling nearly 100 languages present in PDF documents on the internet
The adaptability of DeepSeek-OCR to different languages emphasizes its utility in processing multilingual data for LLM/VLM pretraining
Its practical performance is highlighted by generating training data at scale for LLMs/VLMs with high OCR accuracy compared to existing models like GOT-OCR2.0 and MinerU2.0
Overall, DeepSeek-OCR's capabilities make it a promising tool for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Haoran Wei, Yaofeng Sun, Yukun Li

arXiv: 2510.18234v1 - DOI (cs.CV)

License: CC BY 4.0

Abstract: We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR.

Submitted to arXiv on 21 Oct. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2510.18234v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In a recent study, the capabilities of DeepSeek-OCR were explored for its deep parsing abilities. This allows the model to analyze images within documents through secondary model calls. With this feature, known as "deep parsing," the model can extract structured information from various types of images such as charts, natural images, chemical formulas, and geometric figures with just one unified prompt. The versatility of DeepSeek-OCR is showcased through its ability to perform deep parsing on financial charts, natural images in books and articles, chemical formulas in STEM documents, and planar geometric figures. This highlights its potential applications in diverse fields. Furthermore, DeepSeek-OCR demonstrates impressive multilingual recognition proficiency by handling nearly 100 languages present in PDF documents on the internet. This capability is crucial for training Large Language Models (LLMs) as it supports both layout and non-layout OCR formats for languages like Arabic and Sinhala. The adaptability of DeepSeek-OCR to different languages further emphasizes its utility in processing multilingual data for LLM/VLM pretraining. Additionally, its practical performance is highlighted by its ability to generate training data at scale for LLMs/VLMs. By achieving high OCR accuracy with minimal vision tokens on OmniDocBench compared to existing models like GOT-OCR2.0 and MinerU2.0, DeepSeek-OCR proves its efficiency in producing quality training data efficiently. Overall,<Organization>'s deep parsing capabilities, multilingual recognition proficiency, and practical performance make it a promising tool for various research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Its accessibility through publicly available codes and model weights further enhances its value for researchers seeking advanced optical mapping solutions.

- DeepSeek-OCR's deep parsing abilities allow it to analyze images within documents through secondary model calls
- The model can extract structured information from various types of images such as charts, natural images, chemical formulas, and geometric figures with just one unified prompt
- DeepSeek-OCR showcases versatility by performing deep parsing on financial charts, natural images, chemical formulas, and planar geometric figures
- It demonstrates impressive multilingual recognition proficiency by handling nearly 100 languages present in PDF documents on the internet
- The adaptability of DeepSeek-OCR to different languages emphasizes its utility in processing multilingual data for LLM/VLM pretraining
- Its practical performance is highlighted by generating training data at scale for LLMs/VLMs with high OCR accuracy compared to existing models like GOT-OCR2.0 and MinerU2.0
- Overall, DeepSeek-OCR's capabilities make it a promising tool for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs

SummaryDeepSeek-OCR is a smart tool that can look at pictures in documents and understand them really well. It can find important information from different types of images like charts, pictures, formulas, and shapes using just one command. This tool is great at understanding many languages found in online documents and can be used to help computers learn new things. DeepSeek-OCR is very good at working with different languages and can help make other computer programs better by giving them lots of examples to learn from. Definitions- DeepSeek-OCR: A smart tool that can analyze images in documents. - Parsing: Understanding and extracting information from data. - Versatility: Being able to do many different things well. - Multilingual: Able to work with multiple languages. - Adaptability: Being able to change or adjust easily. - OCR (Optical Character Recognition): Technology that recognizes text within images. - LLM/VLM pretraining: Training models for natural language processing tasks.

Introduction

Deep learning has revolutionized the field of optical character recognition (OCR) by providing more accurate and efficient solutions for extracting text from images. However, traditional OCR models often struggle with complex documents that contain a mix of text, charts, and other visual elements. This is where DeepSeek-OCR comes in - a model that goes beyond traditional OCR capabilities by incorporating deep parsing to extract structured information from various types of images within documents.

The Capabilities of DeepSeek-OCR

DeepSeek-OCR stands out for its ability to perform deep parsing on different types of images such as financial charts, natural images, chemical formulas, and geometric figures. This makes it a versatile tool with potential applications in diverse fields. One area where DeepSeek-OCR shines is in its ability to handle financial data. Financial charts can be challenging for traditional OCR models due to their complex nature. However, DeepSeek-OCR can accurately extract data from these charts through secondary model calls. This feature is particularly useful for researchers working with large amounts of financial data. In addition to financial data, DeepSeek-OCR also excels at processing natural images found in books and articles. With just one unified prompt, the model can analyze these images and extract relevant information. This capability is especially valuable for researchers working with historical texts or scientific articles that contain numerous visual elements. Another impressive aspect of DeepSeek-OCR is its proficiency in recognizing chemical formulas present in STEM documents. These formulas are often difficult for traditional OCR models to decipher due to their unique symbols and formatting. However,'s deep parsing abilities allow it to accurately extract this information without any additional prompts or training. Lastly,'s adaptability extends beyond English language documents as it supports nearly 100 languages present in PDF files on the internet. This multilingual recognition proficiency makes it an ideal tool for training Large Language Models (LLMs) as it can handle both layout and non-layout OCR formats for languages like Arabic and Sinhala.

Practical Performance of DeepSeek-OCR

In addition to its impressive capabilities, DeepSeek-OCR also demonstrates practical performance in generating training data at scale for LLMs/VLMs. This is achieved through high OCR accuracy with minimal vision tokens on OmniDocBench, a benchmark dataset for evaluating document image analysis systems. Compared to existing models like GOT-OCR2.0 and MinerU2.0, DeepSeek-OCR proves to be more efficient in producing quality training data. This practical performance is crucial for researchers working with large language models as they require vast amounts of training data to achieve optimal results. With the ability to generate high-quality training data efficiently, DeepSeek-OCR becomes a valuable tool for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs.

Conclusion

DeepSeek-OCR's deep parsing capabilities, multilingual recognition proficiency, and practical performance make it a promising tool for various research areas that require advanced optical mapping solutions. Its versatility allows it to handle different types of images within documents, making it suitable for use in diverse fields such as finance, literature, science, and more. Moreover,'s accessibility through publicly available codes and model weights further enhances its value for researchers seeking efficient OCR solutions. As technology continues to advance at a rapid pace,'s innovative approach towards OCR will undoubtedly play an essential role in shaping the future of document analysis.

Created on 23 Oct. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

65.9%

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

cs.CV

64.8%

Analyzing the Efficacy of an LLM-Only Approach for Image-based Document Quest…

cs.CV

63.0%

Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Eva…

cs.CV

61.9%

WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Unde…

cs.CV

59.9%

Patchfinder: Leveraging Visual Language Models for Accurate Information Retri…

cs.CV

59.7%

Enhancing Document Information Analysis with Multi-Task Pre-training: A Robus…

cs.CV

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.