Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

AI-generated keywords: Open-Vocabulary Recognition Panoptic Segmentation Text-Image Discriminative Models Text-to-Image Generation ODISE

AI-generated Key Points

  • Open-vocabulary recognition in computer vision is a significant problem.
  • Panoptic segmentation, which combines object instance parsing and scene semantics, is challenging.
  • Current methods for open-vocabulary recognition rely on pre-trained text-image discriminative models.
  • These models lack spatial and relational understanding necessary for scene-level comprehension.
  • Text-to-image generation using diffusion models has revolutionized image synthesis.
  • Diffusion models compute cross attention between text and visual representation during image generation.
  • ODISE (Open Vocabulary DIffusion based panoptic SEgmentation) combines pre-trained text-image diffusion and discriminative models.
  • ODISE aims to overcome limitations of existing methods in terms of spatial and relational understanding.
  • ODISE outperforms previous state-of-the-art approaches in open vocabulary panoptic segmentation and semantic segmentation tasks.
  • ODISE achieves significant improvements in performance metrics such as PQ (Panoptic Quality) and mIoU (mean Intersection over Union).
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, Shalini De Mello

CVPR 2022. Project page: https://jerryxu.net/ODISE
License: CC BY 4.0

Abstract: We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation. Text-to-image diffusion models have shown the remarkable capability of generating high-quality images with diverse open-vocabulary language descriptions. This demonstrates that their internal representation space is highly correlated with open concepts in the real world. Text-image discriminative models like CLIP, on the other hand, are good at classifying images into open-vocabulary labels. We propose to leverage the frozen representation of both these models to perform panoptic segmentation of any category in the wild. Our approach outperforms the previous state of the art by significant margins on both open-vocabulary panoptic and semantic segmentation tasks. In particular, with COCO training only, our method achieves 23.4 PQ and 30.0 mIoU on the ADE20K dataset, with 8.3 PQ and 7.9 mIoU absolute improvement over the previous state-of-the-art. Project page is available at \url{https://jerryxu.net/ODISE}.

Submitted to arXiv on 08 Mar. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2303.04803v1

In the field of computer vision, the problem of open-vocabulary recognition has gained significant attention. Open-vocabulary recognition aims to replicate human-like understanding by recognizing limitless categories in a scene. However, very few approaches have been able to provide a unified framework for parsing object instances and scene semantics simultaneously, known as panoptic segmentation. Most current methods for open-vocabulary recognition rely on pre-trained text-image discriminative models trained with large-scale data. While these models excel at classifying individual objects and pixels, they often lack spatial and relational understanding necessary for scene-level structural comprehension. For example, CLIP, a popular text-image discriminative model, has been shown to struggle with accurately identifying spatial relations between objects. On the other hand, text-to-image generation using diffusion models trained on Internet-scale data has revolutionized image synthesis. These diffusion models can generate high quality images based on diverse open vocabulary language descriptions. Interestingly, these models compute cross attention between the provided text and their internal visual representation during the image generation process which suggests that their internal representation may be well differentiated and correlated with high/mid level semantic concepts described by language. Motivated by this observation, the authors propose ODISE (Open Vocabulary DIffusion based panoptic SEgmentation), a novel approach that combines pre trained text image diffusion and discriminative models to perform open vocabulary panoptic segmentation. By leveraging the frozen representations of both types of models ODISE aims to overcome the limitations of existing methods in terms of spatial and relational understanding. The results show that ODISE outperforms previous state of the art approaches in both open vocabulary panoptic segmentation and semantic segmentation tasks with only COCO training data achieving significant improvements in performance metrics such as PQ (Panoptic Quality) and mIoU (mean Intersection over Union). For instance on the ADE20K dataset ODISE achieves a PQ of 23.4 and mIoU of 30 surpassing previous state of the art by 8 3 PQ and 7 9 mIoU respectively.
Created on 12 Jul. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.