HawkEye: Training Video-Text LLMs for Grounding Text in Videos

AI-generated keywords: Video-text LLMs

AI-generated Key Points

  • Video-text Large Language Models (LLMs) struggle with understanding and grounding text queries in longer and more complex videos
  • Long-form videos like movies, tutorials, and documentaries are crucial for conveying information, knowledge, opinions, and emotions
  • The paper focuses on enhancing the temporal video grounding abilities of existing video-text LLMs
  • HawkEye is a proposed solution that improves multi-modal understanding abilities through targeted training on long-form videos
  • HawkEye demonstrates superior performance in various video-text tasks such as temporal video grounding, question grounding, and video question answering
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, Dongyan Zhao

License: CC BY-NC-SA 4.0

Abstract: Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos. However, they perform almost the same as random on grounding text queries in long and complicated videos, having little ability to understand and reason about temporal information, which is the most fundamental difference between videos and images. In this paper, we propose HawkEye, one of the first video-text LLMs that can perform temporal video grounding in a fully text-to-text manner. To collect training data that is applicable for temporal video grounding, we construct InternVid-G, a large-scale video-text corpus with segment-level captions and negative spans, with which we introduce two new time-aware training objectives to video-text LLMs. We also propose a coarse-grained method of representing segments in videos, which is more robust and easier for LLMs to learn and follow than other alternatives. Extensive experiments show that HawkEye is better at temporal video grounding and comparable on other video-text tasks with existing video-text LLMs, which verifies its superior video-text multi-modal understanding abilities.

Submitted to arXiv on 15 Mar. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2403.10228v1

, , , , In the realm of video-text Large Language Models (LLMs), while these models have shown impressive capabilities in answering questions and engaging in conversations based on simple videos, they struggle when it comes to understanding and grounding text queries in longer and more complex videos. This is a significant limitation as long-form videos such as movies, tutorials, and documentaries play a crucial role in conveying a wealth of information, knowledge, opinions, and emotions. To address this challenge, this paper focuses on enhancing the temporal video grounding abilities of existing video-text LLMs. <kw>Video-text LLMs:</kw> In recent years, video-text LLMs have gained attention for their ability to answer questions and engage in conversations based on simple videos. However, they face challenges when it comes to understanding longer and more complex videos. <kw>Temporal video grounding:</kw> The main focus of this paper is to improve the temporal video grounding abilities of existing video-text LLMs. This task involves identifying specific segments related to given text queries within a longer video containing multiple actions and events. <kw>Long-form videos:</kw> Long-form videos such as movies, tutorials, and documentaries are essential for conveying a wide range of information, knowledge, opinions, and emotions. However, current video-text LLMs struggle with understanding these types of videos. <kw>HawkEye:</kw> The proposed solution involves targeted training on long-form videos without extensively modifying pre-training or visual-text alignment stages. This results in HawkEye - one of the first fully text-to-text capable video-text LLMs with advanced multi-modal understanding abilities. <kw>Multi-modal understanding abilities:</kw> Through effective representation formats and a large-scale time-aware training dataset, HawkEye showcases its superior performance in various video-text tasks, including temporal video grounding, question grounding, and video question answering.
Created on 20 Sep. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.