Back to Basics: Let Denoising Generative Models Denoise

AI-generated keywords: Denoising Generative Models Predicting Clean Data Low-Dimensional Manifold Just Image Transformers (JiT) High-Dimensional Spaces

AI-generated Key Points

  • Authors Tianhong Li and Kaiming He address limitations of current denoising diffusion models
  • Current models focus on predicting noise or a noised quantity, not clean images
  • Distinguishing between noise and natural data is crucial in generative modeling tasks
  • Proposed approach by Li and He involves models directly predicting clean data
  • Simple large-patch Transformers are used as strong generative models without tokenizers or pre-training
  • Approach, called "Just image Transformers" (JiT), achieves competitive results on ImageNet at resolutions of 256 and 512 with patch sizes of 16 and 32
  • Research focuses on mapping back to basics of the manifold assumption for Transformer-based diffusion on raw natural data
  • JiT framework driven solely by plain Transformers without additional losses or pre-training, making it computationally efficient and scalable across different resolutions
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Tianhong Li, Kaiming He

Tech report. Code at https://github.com/LTH14/JiT
License: CC BY 4.0

Abstract: Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "$\textbf{Just image Transformers}$", or $\textbf{JiT}$, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

Submitted to arXiv on 17 Nov. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2511.13720v1

In their paper "Back to Basics: Let Denoising Generative Models Denoise," authors Tianhong Li and Kaiming He address the limitations of current denoising diffusion models. These models do not directly predict clean images but instead focus on predicting noise or a noised quantity. The authors argue that predicting clean data and predicting noised quantities are fundamentally different tasks. They highlight the importance of distinguishing between noise and natural data in generative modeling tasks. The proposed approach by Li and He is a novel one where models directly predict clean data. This enables apparently under-capacity networks to effectively operate in high-dimensional spaces. The authors demonstrate the effectiveness of simple large-patch Transformers on pixels as strong generative models without the need for tokenizers, pre-training, or extra loss functions. This approach, dubbed "Just image Transformers" (JiT), yields competitive results on ImageNet at resolutions of 256 and 512 using patch sizes of 16 and 32. By focusing on mapping back to the basics of the manifold assumption, this research offers a self-contained paradigm for Transformer-based diffusion on raw natural data. The authors underscore that their approach is driven solely by plain Transformers without additional losses or pre-training, making it computationally efficient and scalable across different resolutions. The paper also provides qualitative results showcasing the efficacy of JiT-H/32 on ImageNet 512×512 images and compares its performance with previous methods. In conclusion, this study sheds light on the significance of addressing noise differently from natural data in generative modeling tasks. By leveraging basic principles and avoiding unnecessary complexities, the proposed JiT framework offers a promising direction for advancing diffusion models in high-dimensional spaces without compromising computational efficiency.
Created on 26 Feb. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.