eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

AI-generated keywords: eDiffi

AI-generated Key Points

  • eDiffi is a text-to-image diffusion model that improves the synthesis process.
  • It uses an ensemble of expert denoisers to enhance text alignment without compromising visual quality or computational cost.
  • Specialized models are trained for different synthesis stages to address the issue of changing generation behavior throughout the iterative process.
  • Various embeddings, including T5 text, CLIP text, and CLIP image embeddings, are used for conditioning.
  • eDiffi offers a "paint-with-words" capability that allows users to select words in the input text and control the output image.
  • Experimental results show that eDiffi outperforms previous large-scale text-to-image diffusion models on benchmark datasets.
  • It handles long detailed descriptions better than methods like DALL·E 2 and Stable Diffusion.
  • Increasing network depth or adding more experts does not significantly impact feed-forward evaluation time.
  • eDiffi provides state-of-the-art performance in high-resolution image synthesis from text prompts while offering controllability and efficiency enhancements.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, Ming-Yu Liu

License: CC BY 4.0

Abstract: Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiffi's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiffi/

Submitted to arXiv on 02 Nov. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2211.01324v1

, , , , The paper presents eDiffi, a text-to-image diffusion model that utilizes an ensemble of expert denoisers to improve the synthesis process. The authors observe that the generation behavior of text-to-image diffusion models changes throughout the iterative process, with early stages relying heavily on text prompts and later stages ignoring text conditioning. To address this issue, eDiffi trains specialized models for different synthesis stages, resulting in improved text alignment without compromising visual quality or computational cost. The model is trained using various embeddings for conditioning, including T5 text, CLIP text, and CLIP image embeddings. Additionally, eDiffi offers a unique "paint-with-words" capability that allows users to select words in the input text and paint them on a canvas to control the output image. Experimental results demonstrate that eDiffi outperforms previous large-scale text-to-image diffusion models on benchmark datasets. Furthermore, comparisons with other methods such as DALL·E 2 and Stable Diffusion highlight eDiffi's ability to handle long detailed descriptions better than these methods. The inference time of eDiffi is also evaluated based on model capacity, showing that increasing network depth or adding more experts does not significantly impact feed-forward evaluation time. In summary, offers state-of-the-art performance in high-resolution image synthesis from text prompts while providing controllability and efficiency enhancements.
Created on 16 Jan. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.