GeneCIS: A Benchmark for General Conditional Image Similarity

AI-generated keywords: General Conditional Image Similarity Zero-Shot Evaluation CLIP Models Image-Caption Datasets ViT-B/16 Model

AI-generated Key Points

  • Models should be able to adapt to different notions of similarity dynamically
  • The authors propose the GeneCIS benchmark for General Conditional Image Similarity to measure models' ability to adapt to a range of similarity conditions
  • The benchmark is designed for zero-shot evaluation only and considers an open-set of similarity conditions
  • Baselines from powerful CLIP models struggle on GeneCIS and performance on the benchmark is weakly correlated with ImageNet accuracy
  • The authors propose a simple, scalable solution based on automatically mining information from existing image-caption datasets to address this issue
  • Their method offers a substantial boost over the baselines on GeneCIS and further improves zero-shot performance on related image retrieval benchmarks
  • Their model surpasses state-of-the-art supervised models on MIT-States even though evaluated zero-shot
  • Statistics of evaluations are shown in Table 1 including number of retrieval templates and gallery images as well as carefully constructed benchmarks with only one 'positive' image among targets with gallery sizes between 10 and 15 images.
  • Distribution of objects and attributes specified in the conditions are shown in Figure 3 noting that their space of conditions spans a long tail of over 400 attributes and 100 objects.
  • Strongest ViT-B/16 model's results are reported in Table 2 and scaling up mined triplets used for training improves performance shown in Figure 5.
  • Different CLIP backbones' impact on their model's performance is explored in Figure 6.
  • This paper proposes an important but understudied problem in computer vision: General Conditional Image Similarity.
  • The proposed benchmark evaluates an open set of similarity conditions and is designed for zero shot testing only.
  • The authors propose a way forward for scalably training conditional similarity models which mines information from widely available image caption datasets.
  • Their method not only boosts performance over all baselines on GeneCIS but also provides substantial zero shot gains on related image retrieval tasks.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Sagar Vaze, Nicolas Carion, Ishan Misra

CVPR 2023 (Highlighted Paper). Project page at https://sgvaze.github.io/genecis/
License: CC BY-NC-SA 4.0

Abstract: We argue that there are many notions of 'similarity' and that models, like humans, should be able to adapt to these dynamically. This contrasts with most representation learning methods, supervised or self-supervised, which learn a fixed embedding function and hence implicitly assume a single notion of similarity. For instance, models trained on ImageNet are biased towards object categories, while a user might prefer the model to focus on colors, textures or specific elements in the scene. In this paper, we propose the GeneCIS ('genesis') benchmark, which measures models' ability to adapt to a range of similarity conditions. Extending prior work, our benchmark is designed for zero-shot evaluation only, and hence considers an open-set of similarity conditions. We find that baselines from powerful CLIP models struggle on GeneCIS and that performance on the benchmark is only weakly correlated with ImageNet accuracy, suggesting that simply scaling existing methods is not fruitful. We further propose a simple, scalable solution based on automatically mining information from existing image-caption datasets. We find our method offers a substantial boost over the baselines on GeneCIS, and further improves zero-shot performance on related image retrieval benchmarks. In fact, though evaluated zero-shot, our model surpasses state-of-the-art supervised models on MIT-States. Project page at https://sgvaze.github.io/genecis/.

Submitted to arXiv on 13 Jun. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2306.07969v1

In this paper, the authors argue that models should be able to adapt to different notions of similarity dynamically. To address this issue, they propose the GeneCIS benchmark for General Conditional Image Similarity which measures models' ability to adapt to a range of similarity conditions. The benchmark is designed for zero-shot evaluation only and considers an open-set of similarity conditions. The authors find that baselines from powerful CLIP models struggle on GeneCIS and that performance on the benchmark is weakly correlated with ImageNet accuracy. To address this issue, the authors propose a simple, scalable solution based on automatically mining information from existing image-caption datasets. They find their method offers a substantial boost over the baselines on GeneCIS and further improves zero-shot performance on related image retrieval benchmarks. In fact, though evaluated zero-shot, their model surpasses state-of-the-art supervised models on MIT-States. The authors also show statistics of the evaluations in Table 1 including the number of retrieval templates and number of gallery images as well as carefully constructing the benchmarks such that there is only one ‘positive’ image among the targets with gallery sizes between 10 and 15 images. They further show the distribution of objects and attributes specified in the conditions in Figure 3 noting that their space of conditions spans a long tail of over 400 attributes and 100 objects. The authors report their strongest ViT-B/16 model's results in Table 2 and show how scaling up mined triplets used for training improves performance in Figure 5. They also explore different CLIP backbones' impact on their model's performance in Figure 6. In conclusion, this paper proposes an important but understudied problem in computer vision: General Conditional Image Similarity. The proposed benchmark evaluates an open set of similarity conditions and is designed for zero shot testing only. The authors propose a way forward for scalably training conditional similarity models which mines information from widely available image caption datasets. Their method not only boosts performance over all baselines on GeneCIS but also provides substantial zero shot gains on related image retrieval tasks.
Created on 14 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.