Learning to Prompt with Text Only Supervision for Vision-Language Models
AI-generated Key Points
- Challenges of adapting foundational vision-language models like CLIP for downstream tasks while maintaining generalization abilities
- Learning prompts using visual information requires labeled data and struggles to generalize to new datasets
- Generating class descriptions from large language models (LLMs) and performing prompt ensembling can result in class-specific prompts that cannot be transferred to other classes
- Proposed method combines the strengths of both approaches by learning prompts using only text data derived from LLMs
- Training approach allows prompts to extract rich contextual knowledge from LLM data, enabling zero-shot transfer of prompts to new classes and datasets
- Claimed as the first work to learn generalized prompts using text-only data
- Evaluation on four benchmarks demonstrates improvements over prior ensembling works and remains competitive with methods utilizing labeled images
- Ablative analysis on understanding ProText prompts and average confidence scores obtained from ProText logits trained on ImageNet-1k text data when applied to cross-datasets show increased performance compared to CLIP
- Introduces a promising approach for adapting vision-language models by learning prompts using only text data, potentially improving generalization capabilities while reducing reliance on labeled images.
Authors: Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, Luc Van Gool, Federico Tombari
Abstract: Foundational vision-language models such as CLIP are becoming a new paradigm in vision, due to their excellent generalization abilities. However, adapting these models for downstream tasks while maintaining their generalization remains a challenge. In literature, one branch of methods adapts CLIP by learning prompts using visual information. While effective, most of these works require labeled data which is not practical, and often struggle to generalize towards new datasets due to over-fitting on the source data. An alternative approach resorts to training-free methods by generating class descriptions from large language models (LLMs) and perform prompt ensembling. However, these methods often generate class specific prompts that cannot be transferred to other classes, which incur higher costs by generating LLM descriptions for each class separately. In this work, we propose to combine the strengths of these both streams of methods by learning prompts using only text data derived from LLMs. As supervised training of prompts is not trivial due to absence of images, we develop a training approach that allows prompts to extract rich contextual knowledge from LLM data. Moreover, with LLM contextual data mapped within the learned prompts, it enables zero-shot transfer of prompts to new classes and datasets potentially cutting the LLM prompt engineering cost. To the best of our knowledge, this is the first work that learns generalized prompts using text only data. We perform extensive evaluations on 4 benchmarks where our method improves over prior ensembling works while being competitive to those utilizing labeled images. Our code and pre-trained models are available at https://github.com/muzairkhattak/ProText.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.