Improved Baselines with Visual Instruction Tuning

AI-generated keywords: Multimodal models Visual Instruction Tuning LLaVA framework Cross-modal connector Reproducibility

AI-generated Key Points

  • The paper "Improved Baselines with Visual Instruction Tuning" presents advancements in large multimodal models (LMM) within the LLaVA framework.
  • Implementation of a fully-connected vision-language cross-modal connector and incorporation of academic task-related data, such as Visual Question Answering (VQA) prompts, lead to state-of-the-art performance across 11 benchmarks.
  • Modifications in LLaVA framework are powerful and data-efficient, resulting in stronger baselines with high training sample efficiency.
  • LLaVA's simplicity stands out compared to other approaches like InstructBLIP or Qwen-VL, which rely on complex visual resamplers trained on massive amounts of image-text paired data.
  • By training only a simple fully-connected projection layer on a relatively small dataset of 600K image-text pairs, impressive results are achieved in approximately one day on a single 8-A100 machine.
  • LLaVA's approach differs from Qwen-VL by exclusively using publicly available data for training purposes, enhancing reproducibility and accessibility in state-of-the-art LMM research.
  • Improvements made to the LLaVA framework showcase its potential for advancing multimodal understanding capabilities efficiently and effectively.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

Tech report, 4 pages. LLaVA project page: https://llava-vl.github.io
License: CC BY 4.0

Abstract: Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

Submitted to arXiv on 05 Oct. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2310.03744v1

The paper "Improved Baselines with Visual Instruction Tuning" presents advancements in large multimodal models (LMM) within the LLaVA framework. The authors demonstrate that by implementing a fully-connected vision-language cross-modal connector and incorporating academic task-related data, such as Visual Question Answering (VQA) prompts, they achieve state-of-the-art performance across 11 benchmarks. These modifications are shown to be powerful and data-efficient, leading to stronger baselines with high training sample efficiency. Compared to other approaches like InstructBLIP or Qwen-VL, which rely on complex visual resamplers trained on massive amounts of image-text paired data, LLaVA stands out for its simplicity. By utilizing a straightforward architecture design and training only a simple fully-connected projection layer on a relatively small dataset of 600K image-text pairs, the authors achieve impressive results. The final model can be trained in approximately one day on a single 8-A100 machine. Notably, LLaVA's approach differs from Qwen-VL by exclusively using publicly available data for training purposes. This decision enhances reproducibility and accessibility in state-of-the-art LMM research. The improvements made to the LLaVA framework showcase its potential for advancing multimodal understanding capabilities efficiently and effectively. Overall, the study highlights the significance of simple yet impactful modifications in enhancing multimodal models' performance and making cutting-edge research more accessible and reproducible within the field of large multimodal models.
Created on 13 Jun. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.