Modeling Caption Diversity in Contrastive Vision-Language Pretraining
AI-generated Key Points
- Llip is a new method that focuses on capturing the diversity of captions for images by conditioning visual features on textual information.
- Llip surpasses non-contextualized baselines like CLIP and SigLIP across various tasks, even with large-scale encoders.
- Llip improves image captioning performance by an average of 2.9% on benchmark datasets when using a ViT-G/14 encoder.
- On the ImageNet dataset, Llip achieves a zero-shot top-1 accuracy of 83.5%, outperforming models like CLIP by 1.4%.
- Llip demonstrates significant enhancements in zero-shot retrieval performance on MS-COCO by 6.0%.
- Compared to other contrastive pretraining baselines, including OpenCLIP, CLIPA-v2, MetaCLIP, and DFN among others, Llip shows promising results and competitiveness.
- Llip is effective in scenarios requiring diverse captioning and image understanding tasks within contrastive vision-language frameworks.
Authors: Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mahmoud Assran, Andrew Gordon Wildon, Aaron Courville, Nicolas Ballas
Abstract: There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its caption to a single vector -- limiting how well CLIP-like models can represent the diverse ways to describe an image. In this work, we introduce Llip, Latent Language Image Pretraining, which models the diversity of captions that could match an image. Llip's vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text. We show that Llip outperforms non-contextualized baselines like CLIP and SigLIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a ViT-G/14 encoder. Specifically, Llip attains a zero-shot top-1 accuracy of 83.5% on ImageNet outperforming a similarly sized CLIP by 1.4%. We also demonstrate improvement on zero-shot retrieval on MS-COCO by 6.0%. We provide a comprehensive analysis of the components introduced by the method and demonstrate that Llip leads to richer visual representations.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.