Splicing ViT Features for Semantic Appearance Transfer

AI-generated keywords: Semantic Appearance Transfer Splicing ViT Features DINO-ViT Feature Inversion Visualization High-Resolution Results

AI-generated Key Points

  • Authors present a method for transferring visual appearance between natural images
  • Goal is to generate an image where objects in a source image have the appearance of semantically related objects in a target image
  • Method uses a generator trained on a single structure/appearance image pair
  • Semantic information is incorporated using a pre-trained Vision Transformer (ViT) model as an external semantic prior
  • Novel representations of structure and appearance are derived from deep ViT features
  • Objective function splices these representations together in the space of ViT features
  • Framework called "Splice" does not involve adversarial training or require additional input information such as semantic segmentation or correspondences
  • Can generate high-resolution results and works well on various in-the-wild image pairs with variations in objects, pose, and appearance
  • DINO-ViT's intermediate representations are explored, showing that the global token (CLS token) encodes texture information and captures global aspects like object parts
  • These features provide powerful semantic information at high spatial granularity for reconstructing the original image
  • Overall, the method achieves impressive results without complex training procedures or additional input data
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Narek Tumanyan, Omer Bar-Tal, Shai Bagon, Tali Dekel

License: CC BY 4.0

Abstract: We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image. Our method works by training a generator given only a single structure/appearance image pair as input. To integrate semantic information into our framework - a pivotal component in tackling this task - our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model which serves as an external semantic prior. Specifically, we derive novel representations of structure and appearance extracted from deep ViT features, untwisting them from the learned self-attention modules. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Our framework, which we term "Splice", does not involve adversarial training, nor does it require any additional input information such as semantic segmentation or correspondences, and can generate high-resolution results, e.g., work in HD. We demonstrate high quality results on a variety of in-the-wild image pairs, under significant variations in the number of objects, their pose and appearance.

Submitted to arXiv on 02 Jan. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2201.00424v1

In their paper titled "Splicing ViT Features for Semantic Appearance Transfer," authors Narek Tumanyan, Omer Bar-Tal, Shai Bagon, and Tali Dekel present a method for transferring the visual appearance of one natural image to another. The goal is to generate an image where objects in a source structure image are painted with the visual appearance of their semantically related objects in a target appearance image. The proposed method works by training a generator using only a single structure/appearance image pair as input. To incorporate semantic information into the framework, the authors leverage a pre-trained and fixed Vision Transformer (ViT) model as an external semantic prior. They derive novel representations of structure and appearance from deep ViT features and establish an objective function that splices these representations together in the space of ViT features. Notably, their framework, called "Splice," does not involve adversarial training or require additional input information such as semantic segmentation or correspondences. It can generate high-resolution results and has been demonstrated to produce high-quality outputs on various in-the-wild image pairs, even under significant variations in the number of objects, their pose, and appearance. The authors also explore the intermediate representations learned by DINO-ViT which is known for its powerful visual representation capabilities. Through feature inversion visualization techniques they observe that the global token (CLS token) in ViT encodes not only texture information but also captures more global aspects such as object parts. These features provide powerful semantic information at high spatial granularity and can be used for reconstructing the original image. Overall this paper introduces a novel approach for semantically transferring visual appearances between images using ViT features. The method achieves impressive results without relying on complex training procedures or additional input data making it highly efficient and effective for its purpose.
Created on 06 Sep. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.