Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

AI-generated keywords: Multimodal Open-Vocabulary Visual Recognition VLMs MOV Zero-shot

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Integration of vision and language models (VLMs) pre-trained on image-text pairs has shown promise
  • Researchers including Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, and Yin Cui expanded the paradigm by incorporating motion and audio elements in video data
  • Introduction of MOV method for Multimodal Open-Vocabulary video classification
  • Utilization of vision encoder from pre-trained VLMs to encode video frames, optical flow, and audio spectrograms
  • Cross-modal fusion mechanism to effectively fuse multimodal inputs
  • Significant performance improvements on benchmark datasets Kinetics-700 and VGGSound compared to existing methods and pre-trained VLM alone
  • Enhanced accuracy on base classes and superior generalization capabilities on novel classes with MOV
  • State-of-the-art results on zero-shot video classification benchmarks UCF and HMDB surpassing traditional zero-shot approaches and recent methods based on VLMs
  • Commitment to sharing code and models with research community for further exploration in multimodal open-vocabulary video classification techniques
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, Yin Cui

Abstract: Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf{MOV}, a simple yet effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification benchmarks, significantly outperforming both traditional zero-shot methods and recent methods based on VLMs. Code and models will be released.

Submitted to arXiv on 15 Jul. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2207.07646v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In the realm of open-vocabulary visual recognition, the integration of vision and language models (VLMs) pre-trained on extensive image-text pairs has shown great promise. Building upon this foundation, a team of researchers including Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, and Yin Cui have expanded this paradigm by incorporating motion and audio elements inherent in video data. Their innovative approach introduces \textbf{MOV}, a straightforward yet powerful method for Multimodal Open-Vocabulary video classification. The key innovation behind MOV lies in its utilization of the vision encoder from pre-trained VLMs with minimal adjustments to encode not only video frames but also optical flow and audio spectrograms. By seamlessly integrating these diverse modalities, MOV is able to capture a more comprehensive representation of the underlying visual content. To effectively fuse these multimodal inputs, the researchers have devised a cross-modal fusion mechanism that leverages complementary information across different modalities. Experimental evaluations conducted on benchmark datasets such as Kinetics-700 and VGGSound have demonstrated significant performance improvements when incorporating optical flow or audio modalities compared to existing methods and the pre-trained VLM alone. Particularly noteworthy is MOV's ability to enhance accuracy on base classes while also showcasing superior generalization capabilities on novel classes. Moreover, MOV has achieved state-of-the-art results on zero-shot video classification benchmarks such as UCF and HMDB, surpassing both traditional zero-shot approaches and recent methods based on VLMs. The researchers are committed to sharing their code and models with the research community, thereby facilitating further exploration and advancement in multimodal open-vocabulary video classification techniques.
Created on 28 Apr. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.