AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

AI-generated keywords: Multimodal Language Model AnyGPT Discrete Representations Data Synthesis LLaMA-2

AI-generated Key Points

  • Developed AnyGPT: an any-to-any multimodal language model integrating speech, text, images, and music using discrete representations
  • Does not require alterations to architecture or training paradigms; relies on data-level preprocessing for new modalities
  • Curated a multimodal text-centric dataset and synthesized a large-scale any-to-any multimodal instruction dataset
  • Two-stage approach: generating text-based conversations with multimodal elements and constructing scenarios based on user inputs related to games and interactive media
  • Leveraged LLaMA-2 for fine-tuning model responses across all modalities
  • Experimental results show comparable performance to specialized models in facilitating any-to-any multimodal conversation
  • Demonstrated effectiveness of discrete representations in unifying multiple modalities within the language model
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yugang Jiang, Xipeng Qiu

25 pages, 16 figures, under review, work in progress
License: CC BY-NC-SA 4.0

Abstract: We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/

Submitted to arXiv on 19 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.12226v1

In order to address the limitations of existing multimodal language models, we have developed AnyGPT - an innovative any-to-any multimodal language model that seamlessly integrates various modalities such as speech, text, images, and music using discrete representations. Unlike traditional large language models (LLMs), AnyGPT does not require alterations to its architecture or training paradigms. Instead, it relies on data-level preprocessing to incorporate new modalities similar to adding new languages. To facilitate this integration, we have curated a multimodal text-centric dataset for alignment pre-training and synthesized a large-scale any-to-any multimodal instruction dataset consisting of 108k samples of multi-turn conversations. Our meticulous approach involves two stages: first, generating text-based conversations with multimodal elements using GPT-4 by expanding meta topics into specific scenarios and demonstrating diverse modality combinations; secondly, constructing scenarios based on user inputs related to games and interactive media and incorporating images and music through detailed textual representations. This ensures high-quality data at scale and guides the model in synthesizing contextually appropriate conversational scenarios. Furthermore, we have leveraged LLaMA-2 for fine-tuning the model's responses to improve performance across all modalities. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving comparable performance to specialized models. The demos showcased on the project website highlight the effectiveness of discrete representations in unifying multiple modalities within a language model. For more details on our methodology and results, please refer to the provided link: https://junzhan2000.github.io/AnyGPT.github.io/.
Created on 21 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.