MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

AI-generated keywords: Multi-Modal Video Transformer Compressed Video Action Recognition Spatiotemporal Tokens Cross-Modal Attention

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors Jiawei Chen and Chiu Man Ho introduce MM-ViT, a transformer-based approach for compressed video action recognition
  • MM-ViT operates in the compressed video domain and uses modalities like I-frames, motion vectors, residuals, and audio waveforms
  • Scalable model variants are proposed to handle spatiotemporal tokens from multiple modalities by factorizing self-attention across space, time, and modality dimensions
  • Three distinct cross-modal attention mechanisms are developed to explore rich inter-modal interactions within the transformer building block
  • MM-ViT outperforms state-of-the-art video transformers in efficiency and accuracy on UCF-101, Something-Something-v2, Kinetics-600 benchmarks
  • It performs comparably or better than CNN counterparts using computationally-heavy optical flow
  • MM-ViT efficiently utilizes multiple modalities within a transformer framework to enhance video action recognition tasks
  • This research demonstrates the potential for further advancements in multi-modal deep learning models for complex visual tasks
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jiawei Chen, Chiu Man Ho

Abstract: This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-ViT), for video action recognition. Different from other schemes which solely utilize the decoded RGB frames, MM-ViT operates exclusively in the compressed video domain and exploits all readily available modalities, i.e., I-frames, motion vectors, residuals and audio waveform. In order to handle the large number of spatiotemporal tokens extracted from multiple modalities, we develop several scalable model variants which factorize self-attention across the space, time and modality dimensions. In addition, to further explore the rich inter-modal interactions and their effects, we develop and compare three distinct cross-modal attention mechanisms that can be seamlessly integrated into the transformer building block. Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy, and performs better or equally well to the state-of-the-art CNN counterparts with computationally-heavy optical flow.

Submitted to arXiv on 20 Aug. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2108.09322v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition," authors Jiawei Chen and Chiu Man Ho introduce a novel transformer-based approach for video action recognition. The Multi-Modal Video Transformer (MM-ViT) operates exclusively in the compressed video domain and leverages various modalities such as I-frames, motion vectors, residuals, and audio waveforms. To handle the large number of spatiotemporal tokens extracted from multiple modalities, the authors propose scalable model variants that factorize self-attention across space, time, and modality dimensions. Additionally, they explore rich inter-modal interactions by developing three distinct cross-modal attention mechanisms that can seamlessly integrate into the transformer building block. Through extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600), MM-ViT outperforms state-of-the-art video transformers in terms of efficiency and accuracy. It also performs comparably or even better than CNN counterparts that utilize computationally-heavy optical flow. This research showcases the effectiveness of MM-ViT in enhancing video action recognition tasks by efficiently utilizing multiple modalities within a transformer framework. Overall, this approach not only improves performance but also highlights the potential for further advancements in multi-modal deep learning models for complex visual tasks.
Created on 11 Mar. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.