Any-to-Any Generation via Composable Diffusion

AI-generated keywords: Composable Diffusion Generative Model Multimodal Space Real-world Applications State-of-the-art

AI-generated Key Points

  • Composable Diffusion (CoDi) is a generative model that can generate any combination of output modalities from any combination of input modalities.
  • CoDi can generate multiple modalities in parallel and is highly customizable and flexible.
  • The model achieves strong joint-modality generation quality and outperforms or is on par with the unimodal state-of-the-art for single-modality synthesis.
  • CoDi's ability to align modalities in both the input and output space allows it to freely condition on any input combination and generate any group of modalities even if they are not present in the training data.
  • CoDi employs a novel composable generation strategy that involves building a shared multimodal space by bridging alignment in the diffusion process, enabling synchronized generation of intertwined modalities such as temporally aligned video and audio.
  • CoDi's ability to generate multiple modalities simultaneously makes it ideal for real-world applications where multiple modalities coexist and interact.
  • CoDi consistently outperformed other state-of-the-art models across different tasks such as image captioning, text-to-image synthesis, and audio captioning.
  • The project page with demonstrations and code is available at https://codi-gen.github.io/.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal

Project Page: https://codi-gen.github.io
License: CC BY 4.0

Abstract: We present Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities, such as language, image, video, or audio, from any combination of input modalities. Unlike existing generative AI systems, CoDi can generate multiple modalities in parallel and its input is not limited to a subset of modalities like text or image. Despite the absence of training datasets for many combinations of modalities, we propose to align modalities in both the input and output space. This allows CoDi to freely condition on any input combination and generate any group of modalities, even if they are not present in the training data. CoDi employs a novel composable generation strategy which involves building a shared multimodal space by bridging alignment in the diffusion process, enabling the synchronized generation of intertwined modalities, such as temporally aligned video and audio. Highly customizable and flexible, CoDi achieves strong joint-modality generation quality, and outperforms or is on par with the unimodal state-of-the-art for single-modality synthesis. The project page with demonstrations and code is at https://codi-gen.github.io

Submitted to arXiv on 19 May. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2305.11846v1

Composable Diffusion (CoDi) is a groundbreaking generative model that can generate any combination of output modalities such as language, image, video or audio from any combination of input modalities. Unlike existing generative AI systems that are limited to a subset of modalities like text or image, CoDi can generate multiple modalities in parallel and is highly customizable and flexible. The model achieves strong joint-modality generation quality and outperforms or is on par with the unimodal state-of-the-art for single-modality synthesis. One of the key features of CoDi is its ability to align modalities in both the input and output space. This allows the model to freely condition on any input combination and generate any group of modalities even if they are not present in the training data. To achieve this alignment, CoDi employs a novel composable generation strategy that involves building a shared multimodal space by bridging alignment in the diffusion process. This enables the synchronized generation of intertwined modalities such as temporally aligned video and audio. While other models may be restricted in their real-world applicability where multiple modalities coexist and interact, CoDi's ability to generate multiple modalities simultaneously makes it ideal for real-world applications. Additionally, while one could chain together modality-specific generative models in a multi-step generation setting, this approach would be inherently limited by each step's generation power. A serial multi-step process can also be cumbersome and slow. CoDi's impressive performance has been demonstrated through various evaluations across different tasks such as image captioning, text-to-image synthesis and audio captioning where it consistently outperformed other state-of-the art models. Overall, Composable Diffusion represents an exciting breakthrough in generative modeling technology that has significant potential for real world applications across various industries. The project page with demonstrations and code is available at https://codi-gen.github.io/.
Created on 22 May. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.