Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models
AI-generated Key Points
- Text-to-image (T2I) personalization is a popular approach for generating customized images based on natural language prompts.
- Existing encoder-based techniques are limited to single-class domains, restricting their ability to handle diverse concepts.
- The authors propose a domain-agnostic method that overcomes these limitations.
- They introduce a novel contrastive-based regularization technique to maintain high fidelity to target concept characteristics and keep predicted embeddings close to editable regions of the latent space.
- Experimental results show that the proposed method achieves state-of-the-art performance and generates more semantic tokens compared to unregularized models.
- The method requires less training time and memory compared to previous methods while achieving high quality and fast personalization across diverse domains.
- Related work in text-driven image generation using diffusion models and text-based image editing is discussed, highlighting progress driven by pre-trained diffusion models and large-scale text-to-image models.
- The proposed method builds upon pre-trained models to extend vocabulary and generate personalized concepts.
Authors: Moab Arar, Rinon Gal, Yuval Atzmon, Gal Chechik, Daniel Cohen-Or, Ariel Shamir, Amit H. Bermano
Abstract: Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times. However, most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. In this work, we propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts. We introduce a novel contrastive-based regularization technique to maintain high fidelity to the target concept characteristics while keeping the predicted embeddings close to editable regions of the latent space, by pushing the predicted tokens toward their nearest existing CLIP tokens. Our experimental results demonstrate the effectiveness of our approach and show how the learned tokens are more semantic than tokens predicted by unregularized models. This leads to a better representation that achieves state-of-the-art performance while being more flexible than previous methods.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.