Splicing ViT Features for Semantic Appearance Transfer
AI-generated Key Points
- Authors present a method for transferring visual appearance between natural images
- Goal is to generate an image where objects in a source image have the appearance of semantically related objects in a target image
- Method uses a generator trained on a single structure/appearance image pair
- Semantic information is incorporated using a pre-trained Vision Transformer (ViT) model as an external semantic prior
- Novel representations of structure and appearance are derived from deep ViT features
- Objective function splices these representations together in the space of ViT features
- Framework called "Splice" does not involve adversarial training or require additional input information such as semantic segmentation or correspondences
- Can generate high-resolution results and works well on various in-the-wild image pairs with variations in objects, pose, and appearance
- DINO-ViT's intermediate representations are explored, showing that the global token (CLS token) encodes texture information and captures global aspects like object parts
- These features provide powerful semantic information at high spatial granularity for reconstructing the original image
- Overall, the method achieves impressive results without complex training procedures or additional input data
Authors: Narek Tumanyan, Omer Bar-Tal, Shai Bagon, Tali Dekel
Abstract: We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image. Our method works by training a generator given only a single structure/appearance image pair as input. To integrate semantic information into our framework - a pivotal component in tackling this task - our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model which serves as an external semantic prior. Specifically, we derive novel representations of structure and appearance extracted from deep ViT features, untwisting them from the learned self-attention modules. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Our framework, which we term "Splice", does not involve adversarial training, nor does it require any additional input information such as semantic segmentation or correspondences, and can generate high-resolution results, e.g., work in HD. We demonstrate high quality results on a variety of in-the-wild image pairs, under significant variations in the number of objects, their pose and appearance.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.