How much is a noisy image worth? Data Scaling Laws for Ambient Diffusion

AI-generated keywords: Generative models

AI-generated Key Points

  • Availability of large-scale, high-quality datasets is crucial for the success of generative models like Image Diffusion Models (DMs) and Large Language Models (LLMs)
  • Obtaining clean data can be challenging in fields such as Magnetic Resonance Imaging (MRI) and black-hole imaging due to time limitations or physical impossibilities
  • Scarcity of high-quality data has led to exploration of training generative models using corrupted data like blurry or noisy images
  • Study involving over 80 models showed that a combination of a small set of clean data along with a larger set of highly noisy data can match performance of models trained solely on clean datasets
  • Theoretical evidence suggests that incorporating clean samples into training can significantly reduce sample size requirements for noisy data, leading to enhanced model performance
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Giannis Daras, Yeshwanth Cherapanamjeri, Constantinos Daskalakis

Work in progress
License: CC BY 4.0

Abstract: The quality of generative models depends on the quality of the data they are trained on. Creating large-scale, high-quality datasets is often expensive and sometimes impossible, e.g. in certain scientific applications where there is no access to clean data due to physical or instrumentation constraints. Ambient Diffusion and related frameworks train diffusion models with solely corrupted data (which are usually cheaper to acquire) but ambient models significantly underperform models trained on clean data. We study this phenomenon at scale by training more than $80$ models on data with different corruption levels across three datasets ranging from $30,000$ to $\approx 1.3$M samples. We show that it is impossible, at these sample sizes, to match the performance of models trained on clean data when only training on noisy data. Yet, a combination of a small set of clean data (e.g.~$10\%$ of the total dataset) and a large set of highly noisy data suffices to reach the performance of models trained solely on similar-size datasets of clean data, and in particular to achieve near state-of-the-art performance. We provide theoretical evidence for our findings by developing novel sample complexity bounds for learning from Gaussian Mixtures with heterogeneous variances. Our theoretical model suggests that, for large enough datasets, the effective marginal utility of a noisy sample is exponentially worse than that of a clean sample. Providing a small set of clean samples can significantly reduce the sample size requirements for noisy data, as we also observe in our experiments.

Submitted to arXiv on 05 Nov. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2411.02780v1

, , , , In the realm of generative models, the success of models like Image Diffusion Models (DMs) and Large Language Models (LLMs) hinges on the availability of large-scale, high-quality datasets. However, obtaining clean data can be challenging in certain fields such as Magnetic Resonance Imaging (MRI) and black-hole imaging due to constraints like time limitations or physical impossibilities. Even for general domain image datasets, creating a copyright-free large-scale dataset can be a costly and complex endeavor. This scarcity of high-quality data has led to the exploration of training generative models using corrupted data, such as blurry or noisy images. Recent frameworks have emerged to address this challenge by training generative models with solely corrupted data. However, these ambient diffusion models often fall short in performance compared to models trained on clean data. To delve deeper into this phenomenon, a study was conducted involving over 80 models trained on datasets with varying levels of corruption across different sample sizes ranging from 30,000 to approximately 1.3 million samples. The results revealed that achieving the same level of performance as models trained on clean data solely with noisy data is unattainable at these sample sizes. Nonetheless, a combination of a small set of clean data (e.g., 10% of the total dataset) along with a larger set of highly noisy data proved sufficient to match the performance of models trained solely on similar-sized clean datasets. This hybrid approach even enabled near state-of-the-art performance. The study also provided theoretical evidence through the development of novel sample complexity bounds for learning from Gaussian Mixtures with varying variances. The theoretical model indicated that for sufficiently large datasets, the marginal utility of a noisy sample diminishes exponentially compared to that of a clean sample. By incorporating a small subset of clean samples into training, significant reductions in sample size requirements for noisy data were observed – aligning with the experimental findings. Overall, this research sheds light on the value and impact of incorporating both clean and noisy data in training generative models and highlights how strategic combinations can lead to enhanced model performance even when faced with limited access to pristine datasets.
Created on 07 Nov. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.