MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition
AI-generated Key Points
⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.
- Authors Jiawei Chen and Chiu Man Ho introduce MM-ViT, a transformer-based approach for compressed video action recognition
- MM-ViT operates in the compressed video domain and uses modalities like I-frames, motion vectors, residuals, and audio waveforms
- Scalable model variants are proposed to handle spatiotemporal tokens from multiple modalities by factorizing self-attention across space, time, and modality dimensions
- Three distinct cross-modal attention mechanisms are developed to explore rich inter-modal interactions within the transformer building block
- MM-ViT outperforms state-of-the-art video transformers in efficiency and accuracy on UCF-101, Something-Something-v2, Kinetics-600 benchmarks
- It performs comparably or better than CNN counterparts using computationally-heavy optical flow
- MM-ViT efficiently utilizes multiple modalities within a transformer framework to enhance video action recognition tasks
- This research demonstrates the potential for further advancements in multi-modal deep learning models for complex visual tasks
Authors: Jiawei Chen, Chiu Man Ho
Abstract: This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-ViT), for video action recognition. Different from other schemes which solely utilize the decoded RGB frames, MM-ViT operates exclusively in the compressed video domain and exploits all readily available modalities, i.e., I-frames, motion vectors, residuals and audio waveform. In order to handle the large number of spatiotemporal tokens extracted from multiple modalities, we develop several scalable model variants which factorize self-attention across the space, time and modality dimensions. In addition, to further explore the rich inter-modal interactions and their effects, we develop and compare three distinct cross-modal attention mechanisms that can be seamlessly integrated into the transformer building block. Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy, and performs better or equally well to the state-of-the-art CNN counterparts with computationally-heavy optical flow.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
⚠The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.