An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

AI-generated keywords: Large Vision-Language Models (LVLMs)

AI-generated Key Points

  • Large Vision-Language Models (LVLMs) have gained traction in computer vision and natural language processing
  • FastV is a new method introduced to optimize computational efficiency within LVLMs
  • FastV works by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent layers
  • FastV can significantly reduce computational costs without compromising performance across various image and video understanding tasks
  • FastV offers customizable trade-offs between computational efficiency and performance based on specific requirements
  • Practical implications of FastV include deployment on edge devices and commercial models with limited computational resources
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang

21 papes, 8 figures, code is released at https://github.com/pkunlp-icler/FastV
License: CC BY 4.0

Abstract: In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at https://github.com/pkunlp-icler/FastV.

Submitted to arXiv on 11 Mar. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2403.06764v1

, , , , In recent years, Large Vision-Language Models (LVLMs) have gained significant traction in the fields of computer vision and natural language processing. These powerful models, such as LLaVA-1.5, QwenVL-Chat, and Video-LLaVA, have been instrumental in a wide range of applications, from image description to internet navigation and decision-making in real-world scenarios. However, a key challenge identified in these LVLMs is the inefficient attention computation over visual tokens in the deep layers, leading to computational inefficiencies compared to handling textual data. To address this issue, a new method called FastV has been introduced as a versatile plug-and-play solution aimed at optimizing computational efficiency within LVLMs. FastV works by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent layers. Through extensive evaluations, it has been demonstrated that FastV can significantly reduce computational costs (e.g., achieving a 45% reduction in FLOPs for LLaVA-1.5-13B) without compromising performance across various image and video understanding tasks. One of the key strengths of FastV lies in its customizable nature, allowing users to tailor the trade-off between computational efficiency and performance based on their specific requirements. Remarkably, FastV can compress the FLOPs of a 13B-parameter model to achieve better efficiency than that of a 7B-parameter model while maintaining superior performance levels. The practical implications of FastV are substantial, particularly for deployment on edge devices and commercial models where computational resources are limited. By unlocking the potential of LVLMs through efficient attention mechanisms and token pruning strategies, FastV offers a promising pathway towards enhancing the scalability and applicability of large-scale vision-language models in real-world settings. The code for implementing FastV is openly available at https://github.com/pkunlp-icler/FastV. In conclusion, the introduction of FastV represents a significant advancement in addressing the inefficiencies observed in current LVLMs. By offering a scalable and efficient solution for optimizing attention mechanisms within these models, FastV opens up new possibilities for leveraging LVLMs across diverse applications with improved computational efficiency and performance outcomes.
Created on 13 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.