Back to Basics: Let Denoising Generative Models Denoise

AI-generated keywords: Denoising Generative Models Predicting Clean Data Low-Dimensional Manifold Just Image Transformers (JiT) High-Dimensional Spaces

AI-generated Key Points

Authors Tianhong Li and Kaiming He address limitations of current denoising diffusion models
Current models focus on predicting noise or a noised quantity, not clean images
Distinguishing between noise and natural data is crucial in generative modeling tasks
Proposed approach by Li and He involves models directly predicting clean data
Simple large-patch Transformers are used as strong generative models without tokenizers or pre-training
Approach, called "Just image Transformers" (JiT), achieves competitive results on ImageNet at resolutions of 256 and 512 with patch sizes of 16 and 32
Research focuses on mapping back to basics of the manifold assumption for Transformer-based diffusion on raw natural data
JiT framework driven solely by plain Transformers without additional losses or pre-training, making it computationally efficient and scalable across different resolutions

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Tianhong Li, Kaiming He

arXiv: 2511.13720v1 - DOI (cs.CV)

Tech report. Code at https://github.com/LTH14/JiT

License: CC BY 4.0

Abstract: Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "$\textbf{Just image Transformers}$", or $\textbf{JiT}$, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

Submitted to arXiv on 17 Nov. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2511.13720v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

In their paper "Back to Basics: Let Denoising Generative Models Denoise," authors Tianhong Li and Kaiming He address the limitations of current denoising diffusion models. These models do not directly predict clean images but instead focus on predicting noise or a noised quantity. The authors argue that predicting clean data and predicting noised quantities are fundamentally different tasks. They highlight the importance of distinguishing between noise and natural data in generative modeling tasks. The proposed approach by Li and He is a novel one where models directly predict clean data. This enables apparently under-capacity networks to effectively operate in high-dimensional spaces. The authors demonstrate the effectiveness of simple large-patch Transformers on pixels as strong generative models without the need for tokenizers, pre-training, or extra loss functions. This approach, dubbed "Just image Transformers" (JiT), yields competitive results on ImageNet at resolutions of 256 and 512 using patch sizes of 16 and 32. By focusing on mapping back to the basics of the manifold assumption, this research offers a self-contained paradigm for Transformer-based diffusion on raw natural data. The authors underscore that their approach is driven solely by plain Transformers without additional losses or pre-training, making it computationally efficient and scalable across different resolutions. The paper also provides qualitative results showcasing the efficacy of JiT-H/32 on ImageNet 512×512 images and compares its performance with previous methods. In conclusion, this study sheds light on the significance of addressing noise differently from natural data in generative modeling tasks. By leveraging basic principles and avoiding unnecessary complexities, the proposed JiT framework offers a promising direction for advancing diffusion models in high-dimensional spaces without compromising computational efficiency.

- Authors Tianhong Li and Kaiming He address limitations of current denoising diffusion models
- Current models focus on predicting noise or a noised quantity, not clean images
- Distinguishing between noise and natural data is crucial in generative modeling tasks
- Proposed approach by Li and He involves models directly predicting clean data
- Simple large-patch Transformers are used as strong generative models without tokenizers or pre-training
- Approach, called "Just image Transformers" (JiT), achieves competitive results on ImageNet at resolutions of 256 and 512 with patch sizes of 16 and 32
- Research focuses on mapping back to basics of the manifold assumption for Transformer-based diffusion on raw natural data
- JiT framework driven solely by plain Transformers without additional losses or pre-training, making it computationally efficient and scalable across different resolutions

Summary- Authors Tianhong Li and Kaiming He talk about problems with current models that try to remove noise from images. - These models usually focus on guessing what the noisy image looks like, not what the clean image should be. - It's important to tell the difference between noise (unwanted changes) and natural data (original information) in creating new images. - Li and He suggest a new way where models directly predict what the clean image should be. - They use simple large-patch Transformers as powerful tools for creating images without needing extra steps like tokenizers or pre-training. Definitions- Denoising diffusion models: Techniques used to remove unwanted noise from images by predicting what the original clean image should look like. - Generative modeling: Creating new data, such as images, based on patterns learned from existing examples. - Transformers: A type of machine learning model that processes sequences of data by focusing on different parts at a time.

Introduction

Generative modeling has been a popular research area in the field of machine learning, with applications ranging from image generation to natural language processing. One of the key challenges in generative modeling is dealing with noise and its impact on the quality of generated data. Traditional approaches to denoising involve predicting noise or a noised quantity, rather than directly predicting clean images. However, this approach has limitations as it does not fully capture the complexity and diversity of natural data. In their paper "Back to Basics: Let Denoising Generative Models Denoise," authors Tianhong Li and Kaiming He address these limitations by proposing a novel approach where models directly predict clean data instead of focusing on predicting noise or noised quantities. This new approach, called "Just image Transformers" (JiT), offers promising results for generative modeling tasks without the need for tokenizers, pre-training, or extra loss functions.

The Importance of Distinguishing Noise and Natural Data

Li and He argue that predicting clean data and predicting noised quantities are fundamentally different tasks. By distinguishing between noise and natural data, their proposed JiT framework allows apparently under-capacity networks to effectively operate in high-dimensional spaces. This is achieved by leveraging basic principles without unnecessary complexities. The authors highlight the significance of addressing noise differently from natural data in generative modeling tasks. They emphasize that traditional approaches often fail to capture the underlying structure of natural data due to their focus on predicting noise rather than directly mapping back to clean images.

The JiT Framework

The proposed JiT framework utilizes simple large-patch Transformers on pixels as strong generative models without relying on tokenizers, pre-training, or extra loss functions. This makes it computationally efficient and scalable across different resolutions. By focusing on mapping back to the basics of the manifold assumption, JiT offers a self-contained paradigm for Transformer-based diffusion on raw natural data. This approach allows for effective operation in high-dimensional spaces without compromising computational efficiency.

Results and Comparison

Li and He demonstrate the effectiveness of their proposed JiT framework by showcasing competitive results on ImageNet at resolutions of 256 and 512 using patch sizes of 16 and 32. They compare the performance of JiT with previous methods, highlighting its superiority in terms of both quality and computational efficiency. The paper also provides qualitative results showcasing the efficacy of JiT-H/32 on ImageNet 512×512 images. These results further support the authors' claims about the effectiveness of their proposed approach.

Conclusion

In conclusion, Li and He's research sheds light on the importance of distinguishing between noise and natural data in generative modeling tasks. By leveraging basic principles and avoiding unnecessary complexities, their proposed JiT framework offers a promising direction for advancing diffusion models in high-dimensional spaces without compromising computational efficiency. This study highlights the significance of going back to basics when it comes to denoising generative models. By directly predicting clean data instead of focusing on noise or noised quantities, Li and He's approach offers a more effective solution for capturing the complexity and diversity of natural data. Their work opens up new possibilities for improving generative modeling tasks, making it an important contribution to this field.

Created on 26 Feb. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

65.7%

Scalable Diffusion Models with Transformers

cs.CV

65.0%

PixNerd: Pixel Neural Field Diffusion

cs.CV

62.1%

Hierarchical Text-Conditional Image Generation with CLIP Latents

cs.CV

61.9%

Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

cs.CV

61.5%

Zero-Shot Text-to-Image Generation

cs.CV

61.4%

Diffusion Guided Domain Adaptation of Image Generators

cs.CV

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.