mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

AI-generated keywords: Multi-modal pretraining Transformer architecture mPLUG-2 Modality collaboration Modality entanglement

AI-generated Key Points

  • Significant convergence of language, vision, and multi-modal pretraining in AI field
  • Development of large-scale pre-trained foundation models spanning various domains
  • Transformer architecture (e.g., T5, OFA, Flamingo) crucial in unifying tasks and modalities
  • Challenges with modality entanglement when dealing with multiple modalities in a single network
  • Introduction of mPLUG-2 paradigm for multi-modal pretraining to mitigate effects of modality entanglement
  • mPLUG-2 features a multi-module composition network for effective collaboration and disentanglement of different modalities
  • Achieves state-of-the-art or competitive results on over 30 downstream tasks encompassing multi-modal and uni-modal tasks
  • Demonstrates new state-of-the-art results on video QA and video caption tasks with smaller model size and data scale compared to existing models
  • Strong zero-shot transferability on vision-language and video-language tasks
  • Modularized multi-modal foundation model across text, image, and video domains with enhanced collaboration between modalities
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, Guohai Xu, Ji Zhang, Songfang Huang, Fei Huang, Jingren Zhou

ICML2023
License: CC BY 4.0

Abstract: Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in https://github.com/alibaba/AliceMind.

Submitted to arXiv on 01 Feb. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2302.00402v1

In recent years, there has been a significant convergence of language, vision, and multi-modal pretraining in the field of artificial intelligence. This trend has led to the development of large-scale pre-trained foundation models that span various domains such as language, vision, and multi-modality. The emergence of Transformer architecture has played a crucial role in this convergence, with models like T5, OFA, and Flamingo unifying tasks and modalities within a sequence-to-sequence generation framework. However, challenges arise when dealing with multiple modalities within a single network due to modality entanglement. Different modalities may interfere with each other, especially when there are numerous modalities and tasks involved. To address this issue, a new paradigm called mPLUG-2 has been introduced. <br><br> for multi-modal pretraining aims to benefit from while mitigating the effects of . Unlike traditional paradigms that rely solely on sequence-to-sequence generation or encoder-based instance discrimination,mPLUG-2 introduces a multi-module composition network. This network shares common universal modules for modality collaboration and disentangles different modality modules to effectively deal with modality entanglement. The flexibility of mPLUG-2 allows for the selection of different modules for various understanding and generation tasks across all modalities including text,image,and video.Empirical studies have shown that mPLUG-2 achieves state-of-the-art or competitive results on over 30 downstream tasks encompassing multi-modal tasks such as image-text and video-text understanding and generation as well as uni-modal tasks focusing on text-only, image-only, and video-only understanding. Notably,mPLUG-2 has demonstrated new state-of-the-art results on challenging video QA and video caption tasks with significantly smaller model size and data scale compared to existing models. Additionally,mPLUG-2 exhibits strong zero-shot transferability on vision-language and video-language tasks. Overall,mPLUG-2 presents a modularized multi-modal foundation model across text,image,and video domains. Its innovative design allows for enhanced collaboration between different modalities while effectively addressing the complexities associated with modality entanglement. The release of code and models on GitHub further enhances accessibility to this cutting-edge technology for researchers in the field.
Created on 20 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.