Generative Pretraining in Multimodality

AI-generated keywords: Emu Transformer-based Multimodal Image-Text Generation Autoregressive Training

AI-generated Key Points

  • Emu is a powerful Transformer-based model for generating images and texts in a multimodal context.
  • It can process single-modality or multimodal data inputs, such as interleaved image, text, and video.
  • Emu encodes visual signals into embeddings and combines them with text tokens to form an interleaved input sequence.
  • It is trained end-to-end to classify the next text token or regress the next visual embedding in the multimodal sequence.
  • Emu can explore diverse pretraining data sources at scale, including videos, webpages, image-text pairs, and video-text pairs.
  • It serves as a generalist multimodal interface for both image-to-text and text-to-image tasks.
  • Emu outperforms state-of-the-art large multimodal models across various zero shot/few shot tasks like image captioning and visual question answering.
  • It can also serve as a multimodal assistant via instruction tuning with impressive performance.
  • Qualitative evaluations demonstrate Emu's impressive capabilities that cannot be solely evaluated based on quantitative benchmarks.
  • Emu proves to be an advanced solution for multimodal generation tasks.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang

Code and Demo: https://github.com/baaivision/Emu
License: CC BY 4.0

Abstract: We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.

Submitted to arXiv on 11 Jul. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2307.05222v1

Emu is a powerful Transformer-based model that excels in generating images and texts within a multimodal context. It can seamlessly process any single-modality or multimodal data input, such as interleaved image, text, and video, through a one-model-for-all autoregressive training process. The model encodes visual signals into embeddings and combines them with text tokens to form an interleaved input sequence. Emu is then trained end-to-end with the objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality allows Emu to explore diverse pretraining data sources at scale, including videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. As a result, Emu serves as a generalist multimodal interface for both image-to-text and text-to-image tasks, enabling in-context image and text generation. In terms of performance, Emu outperforms state-of-the art large multimodal models across various zero shot/few shot tasks such as image captioning, visual question answering, video question answering and text to image generation. Its extended capabilities include serving as a multimodal assistant via instruction tuning with impressive performance. The authors conducted qualitative evaluations of Emu to showcase its impressive capabilities that cannot be evaluated solely based on quantitative benchmarks. These real world applications demonstrate Emu's effectiveness in generating high quality outputs. In summary, Emu proves to be an advanced solution for multimodal generation tasks.
Created on 12 Jul. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.