TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification
AI-generated Key Points
- Proposal of a simple yet effective approach to improve alignment between image and text features in vision-language models
- Addressing the problem of coarse alignment in vision encoders struggling to localize attribute-specified objects
- Use of parsing objects and attributes from image-text pairs as supervision signals
- Introduction of a multi-tag classification loss in addition to the commonly used image-text contrastive loss
- Experimental results demonstrating an average improvement of 3.65% over existing alternatives on semantic segmentation datasets
- Ablation study comparing different loss functions for mitigating tag imbalance, with balanced softmax loss outperforming other alternatives
- Evaluation of the proposed method's versatility across different visual encoders, including GroupViT-based encoder
- Significant performance improvements observed on both ViT-based CLIP and GroupViT encoders, highlighting generalization ability
- Visualization results showing accurate localization of attribute-specified objects with attribute supervision
- Overall, presentation of a simple yet effective approach for better aligning image and text features in vision-language models, leading to improved performance on semantic segmentation tasks.
Authors: Qinying Liu, Kecheng Zheng, Wu Wei, Zhan Tong, Yu Liu, Wei Chen, Zilei Wang, Yujun Shen
Abstract: The crux of learning vision-language models is to extract semantically aligned information from visual and linguistic data. Existing attempts usually face the problem of coarse alignment, \textit{e.g.}, the vision encoder struggles in localizing an attribute-specified object. In this work, we propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. Concretely, given an image and its paired text, we manage to parse objects (\textit{e.g.}, cat) and attributes (\textit{e.g.}, black) from the description, which are highly likely to exist in the image. It is noteworthy that the parsing pipeline is fully automatic and thus enjoys good scalability. With these parsed semantics as supervision signals, we can complement the commonly used image-text contrastive loss with the multi-tag classification loss. Extensive experimental results on a broad suite of semantic segmentation datasets substantiate the average 3.65\% improvement of our framework over existing alternatives. Furthermore, the visualization results indicate that attribute supervision makes vision-language models accurately localize attribute-specified objects. Project page can be found at https://qinying-liu.github.io/Tag-Align/
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.