InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
AI-generated Key Points
- InstructBLIP is a novel instruction tuning framework for building generalized vision-language models.
- Building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by additional visual input.
- Vision-language pre-training has been widely studied, but vision-language instruction tuning remains relatively less explored.
- InstructBLIP conducts a comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models.
- The authors gather 26 publicly available datasets, transform them into instruction-tuning format and categorize them into two clusters for held-in instruction tuning and held out zero shot evaluation.
- InstructBLIP introduces an instruction aware visual feature extraction method that enables the model to extract informative features tailored to the given instructions.
- The resulting InstructBLIP models achieve state of the art zero shot performance across all 13 held out datasets substantially outperforming BLIP 2 and the larger Flamingo.
- These models lead to state of the art performance when fine tuned on individual downstream tasks such as achieving 90.7% accuracy on ScienceQA IMG.
- Qualitative examples demonstrate InstructBLIP's various capabilities in complex visual reasoning knowledge grounded image description and multi turn conversations.
- InstructBLIP can serve as an enhanced model initialization for downstream task fine tuning while achieving state of the art results.
- The paper concludes with a call for new research in general purpose multimodal AI and its applications spurred by InstructBLIP's capabilities towards building generalized vision language models.
Authors: Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi
Abstract: General-purpose language models that can solve various language-domain tasks have emerged driven by the pre-training and instruction-tuning pipeline. However, building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by the additional visual input. Although vision-language pre-training has been widely studied, vision-language instruction tuning remains relatively less explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. We gather a wide variety of 26 publicly available datasets, transform them into instruction tuning format and categorize them into two clusters for held-in instruction tuning and held-out zero-shot evaluation. Additionally, we introduce instruction-aware visual feature extraction, a crucial method that enables the model to extract informative features tailored to the given instruction. The resulting InstructBLIP models achieve state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and the larger Flamingo. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models have been open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.