Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages

AI-generated keywords: Multimodal learning

AI-generated Key Points

  • Significant surge in multimodal learning in image-to-text and text-to-image generation
  • Limitations of progress to English language due to lack of large-scale image-text data in other languages
  • Introduction of MPM training paradigm for training large multimodal models in low-resource languages
  • Leveraging multilingual language models for zero-shot multimodal learning across languages
  • Success of MPM demonstrated through VisCPM models achieving state-of-the-art performance in Chinese
  • Contributions of team members crucial in designing, collecting data, and implementing training codebases for MPM and VisCPM
  • Importance of initiatives like MPM in bridging the gap between English-centric advancements and non-English languages
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jinyi Hu, Yuan Yao, Chongyi Wang, Shan Wang, Yinxu Pan, Qianyu Chen, Tianyu Yu, Hanghao Wu, Yue Zhao, Haoye Zhang, Xu Han, Yankai Lin, Jiao Xue, Dahai Li, Zhiyuan Liu, Maosong Sun

https://github.com/OpenBMB/VisCPM.git
License: CC BY 4.0

Abstract: Recently there has been a significant surge in multimodal learning in terms of both image-to-text and text-to-image generation. However, the success is typically limited to English, leaving other languages largely behind. Building a competitive counterpart in other languages is highly challenging due to the low-resource nature of non-English multimodal data (i.e., lack of large-scale, high-quality image-text data). In this work, we propose MPM, an effective training paradigm for training large multimodal models in low-resource languages. MPM demonstrates that Multilingual language models can Pivot zero-shot Multimodal learning across languages. Specifically, based on a strong multilingual large language model, multimodal models pretrained on English-only image-text data can well generalize to other languages in a zero-shot manner for both image-to-text and text-to-image generation, even surpassing models trained on image-text data in native languages. Taking Chinese as a practice of MPM, we build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese. To facilitate future research, we open-source codes and model weights at https://github.com/OpenBMB/VisCPM.git.

Submitted to arXiv on 23 Aug. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2308.12038v1

, , , , In recent years, there has been a significant surge in multimodal learning, particularly in the realms of image-to-text and text-to-image generation. This progress has predominantly been limited to the English language, leaving other languages lagging behind due to the lack of large-scale, high-quality image-text data. To address this challenge, a team of researchers proposed MPM - an innovative training paradigm designed to facilitate the training of large multimodal models in low-resource languages. MPM leverages the power of multilingual language models to pivot zero-shot multimodal learning across languages. By utilizing a strong multilingual large language model as a foundation, multimodal models pretrained on English-only image-text data can effectively generalize to other languages in a zero-shot manner for both image-to-text and text-to-image generation. Surprisingly, these models even surpass those trained on native language image-text data. To demonstrate the effectiveness of MPM, the researchers focused on Chinese as a case study and developed large multimodal models known as VisCPM for image-to-text and text-to-image generation. These models achieved state-of-the-art performance in Chinese and have been made available as open-source resources for future research. The contributions of various team members were instrumental throughout the project. From designing the model architecture to collecting extensive multimodal datasets for pretraining and implementing training codebases, each member played a crucial role in ensuring the success of MPM and VisCPM. Additionally, efforts were made towards evaluating the models' performance through both automatic and human evaluations. Overall, with advancements in powerful multimodal models like GPT-4 and Stable Diffusion reshaping the landscape of AI towards achieving Artificial General Intelligence (AGI), initiatives like MPM are vital in bridging the gap between English-centric advancements and non-English languages. By enabling zero-shot transfer learning across languages for multimodal tasks, MPM opens up new possibilities for advancing AI capabilities globally.
Created on 12 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.