Any-to-Any Generation via Composable Diffusion
AI-generated Key Points
- Composable Diffusion (CoDi) is a generative model that can generate any combination of output modalities from any combination of input modalities.
- CoDi can generate multiple modalities in parallel and is highly customizable and flexible.
- The model achieves strong joint-modality generation quality and outperforms or is on par with the unimodal state-of-the-art for single-modality synthesis.
- CoDi's ability to align modalities in both the input and output space allows it to freely condition on any input combination and generate any group of modalities even if they are not present in the training data.
- CoDi employs a novel composable generation strategy that involves building a shared multimodal space by bridging alignment in the diffusion process, enabling synchronized generation of intertwined modalities such as temporally aligned video and audio.
- CoDi's ability to generate multiple modalities simultaneously makes it ideal for real-world applications where multiple modalities coexist and interact.
- CoDi consistently outperformed other state-of-the-art models across different tasks such as image captioning, text-to-image synthesis, and audio captioning.
- The project page with demonstrations and code is available at https://codi-gen.github.io/.
Authors: Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal
Abstract: We present Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities, such as language, image, video, or audio, from any combination of input modalities. Unlike existing generative AI systems, CoDi can generate multiple modalities in parallel and its input is not limited to a subset of modalities like text or image. Despite the absence of training datasets for many combinations of modalities, we propose to align modalities in both the input and output space. This allows CoDi to freely condition on any input combination and generate any group of modalities, even if they are not present in the training data. CoDi employs a novel composable generation strategy which involves building a shared multimodal space by bridging alignment in the diffusion process, enabling the synchronized generation of intertwined modalities, such as temporally aligned video and audio. Highly customizable and flexible, CoDi achieves strong joint-modality generation quality, and outperforms or is on par with the unimodal state-of-the-art for single-modality synthesis. The project page with demonstrations and code is at https://codi-gen.github.io
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.