DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data
AI-generated Key Points
- Introduction of a new holistic perceptual metric called DreamSim to address limitations of current perceptual similarity metrics
- Motivation for perceiving similarities between images in various ways, including higher-level concepts like object pose and semantic content
- Development of the NIGHTS dataset containing human similarity judgments over image triplets focusing on mid-level similarities
- Use of features from large pre-trained vision models to outperform standard perceptual metrics on the NIGHTS dataset
- Creation of DreamSim by tuning these models on their data to align better with human perception
- Consideration of foreground objects, color, and layout in DreamSim compared to previous metrics or modern image embeddings
- Expansion of measuring perceptual image similarity to encompass factors beyond low-level similarities with DreamSim
- Contribution of the NIGHTS dataset and DreamSim to advancing understanding of human visual perception for tasks like image retrieval and synthesis
Authors: Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, Phillip Isola
Abstract: Current perceptual similarity metrics operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level similarities and differences in image layout, object pose, and semantic content. In this paper, we develop a perceptual metric that assesses images holistically. Our first step is to collect a new dataset of human similarity judgments over image pairs that are alike in diverse ways. Critical to this dataset is that judgments are nearly automatic and shared by all observers. To achieve this we use recent text-to-image models to create synthetic pairs that are perturbed along various dimensions. We observe that popular perceptual metrics fall short of explaining our new data, and we introduce a new metric, DreamSim, tuned to better align with human perception. We analyze how our metric is affected by different visual attributes, and find that it focuses heavily on foreground objects and semantic content while also being sensitive to color and layout. Notably, despite being trained on synthetic data, our metric generalizes to real images, giving strong results on retrieval and reconstruction tasks. Furthermore, our metric outperforms both prior learned metrics and recent large vision models on these tasks.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.