Beyond Appearance: a Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks

AI-generated keywords: Human-centric Visual Tasks SOLIDER Semantic Controller Self-supervised Learning Pseudo Semantic Labels

AI-generated Key Points

Human-centric visual tasks are important and have many applications
The paper proposes a framework called SOLIDER for creating a general human representation from unlabeled images
SOLIDER uses prior knowledge from human images to build pseudo semantic labels and import more semantic information into the learned representation
Different downstream tasks require varying ratios of semantic and appearance information, which SOLIDER addresses through a conditional network with a semantic controller
SOLIDER outperforms state-of-the-art methods on six downstream human-centric visual tasks and builds new baselines for these tasks
The framework can be used in several applications such as image captioning, action recognition, object detection and segmentation, human parsing and pose estimation
A pretext task based on predicting token order is utilized during training to improve performance further

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Weihua Chen, Xianzhe Xu, Jian Jia, Hao luo, Yaohua Wang, Fan Wang, Rong Jin, Xiuyu Sun

arXiv: 2303.17602v1 - DOI (cs.CV)

accepted by CVPR2023

License: CC ZERO 1.0

Abstract: Human-centric visual tasks have attracted increasing research attention due to their widespread applications. In this paper, we aim to learn a general human representation from massive unlabeled human images which can benefit downstream human-centric tasks to the maximum extent. We call this method SOLIDER, a Semantic cOntrollable seLf-supervIseD lEaRning framework. Unlike the existing self-supervised learning methods, prior knowledge from human images is utilized in SOLIDER to build pseudo semantic labels and import more semantic information into the learned representation. Meanwhile, we note that different downstream tasks always require different ratios of semantic information and appearance information. For example, human parsing requires more semantic information, while person re-identification needs more appearance information for identification purpose. So a single learned representation cannot fit for all requirements. To solve this problem, SOLIDER introduces a conditional network with a semantic controller. After the model is trained, users can send values to the controller to produce representations with different ratios of semantic information, which can fit different needs of downstream tasks. Finally, SOLIDER is verified on six downstream human-centric visual tasks. It outperforms state of the arts and builds new baselines for these tasks. The code is released in https://github.com/tinyvision/SOLIDER.

Submitted to arXiv on 30 Mar. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2303.17602v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

Human-centric visual tasks have become increasingly important due to their wide-ranging applications. To address the need for a general human representation from massive unlabeled human images that can benefit downstream human-centric tasks, this paper proposes a Semantic cOntrollable seLf-supervIseD lEaRning framework (SOLIDER). Unlike existing self-supervised learning methods, SOLIDER utilizes prior knowledge from human images to build pseudo semantic labels and import more semantic information into the learned representation. The framework also recognizes that different downstream tasks require varying ratios of semantic and appearance information. For instance, while human parsing requires more semantic information, person re-identification needs more appearance information for identification purposes. To solve this problem, SOLIDER introduces a conditional network with a semantic controller. After training the model, users can send values to the controller to produce representations with different ratios of semantic information that fit various downstream task requirements. In addition to outperforming state-of-the-art methods on six downstream human-centric visual tasks, SOLIDER builds new baselines for these tasks. The proposed framework extends the representation to different downstream human-centric visual tasks by training it with more semantic information than appearance information. Prior knowledge from human images is utilized in SOLIDER to discover semantic information and produce pseudo-semantic labels for every token. The paper also discusses how SOLIDER utilizes a pretext task based on predicting token order as an auxiliary task during training to improve performance further. Finally, the authors demonstrate how SOLIDER can be used in several applications such as image captioning, action recognition, object detection and segmentation, human parsing and pose estimation. The code is available at https://github.com/tinyvision/SOLIDER.

- Human-centric visual tasks are important and have many applications
- The paper proposes a framework called SOLIDER for creating a general human representation from unlabeled images
- SOLIDER uses prior knowledge from human images to build pseudo semantic labels and import more semantic information into the learned representation
- Different downstream tasks require varying ratios of semantic and appearance information, which SOLIDER addresses through a conditional network with a semantic controller
- SOLIDER outperforms state-of-the-art methods on six downstream human-centric visual tasks and builds new baselines for these tasks
- The framework can be used in several applications such as image captioning, action recognition, object detection and segmentation, human parsing and pose estimation
- A pretext task based on predicting token order is utilized during training to improve performance further

Summary: There is a way to make computers understand human pictures better. It's called SOLIDER and it helps with things like recognizing actions, finding objects, and understanding poses. SOLIDER uses what it already knows about humans to learn more about them from pictures. It also has a special trick that helps it get even better at its job. Definitions - Human-centric visual tasks: These are tasks that involve understanding or working with images of people. - Framework: A set of rules or tools for doing something. - Representation: A way of showing or describing something. - Semantic information: Information that relates to the meaning of something. - Downstream tasks: Tasks that use the results of another task as input. - State-of-the-art methods: The best-known ways of doing something at a given time. - Baselines: Starting points or standards for comparison. - Image captioning: Adding words to describe what's happening in a picture. - Action recognition: Figuring out what someone is doing in a video clip. - Object detection and segmentation: Finding and separating different objects in an image. - Human parsing and pose estimation: Understanding how people are positioned in an image. - Pretext task: A task that is used during training to help improve performance later on.

Introducing SOLIDER: A Semantic cOntrollable seLf-supervIseD lEaRning Framework for Human-Centric Visual Tasks

Human-centric visual tasks have become increasingly important due to their wide range of applications. To address the need for a general human representation from massive unlabeled human images that can benefit downstream human-centric tasks, this paper proposes a novel Semantic cOntrollable seLf-supervIseD lEaRning framework (SOLIDER). This framework is designed to extend the representation to different downstream human-centric visual tasks by training it with more semantic information than appearance information.

What is SOLIDER?

SOLIDER is a self-supervised learning method that utilizes prior knowledge from human images to build pseudo semantic labels and import more semantic information into the learned representation. Unlike existing self-supervised learning methods, SOLIDER recognizes that different downstream tasks require varying ratios of semantic and appearance information. For instance, while human parsing requires more semantic information, person re-identification needs more appearance information for identification purposes. To solve this problem, SOLIDER introduces a conditional network with a semantic controller. After training the model, users can send values to the controller to produce representations with different ratios of semantic information that fit various downstream task requirements.

How Does It Work?

The proposed framework consists of two components: an encoder network and a decoder network which are connected by an attention mechanism based on token order prediction as an auxiliary task during training. The encoder takes in input images and extracts tokens which are then used by the decoder network for predicting token order as well as generating pseudo labels using prior knowledge from human images such as body parts or facial features. The decoder also has access to additional contextual cues such as object cooccurrence or scene context which helps improve performance further. Finally, users can control how much semantic versus appearance information is included in each token by sending values to the controller before feeding them into downstream tasks such as image captioning, action recognition, object detection and segmentation, human parsing and pose estimation among others.

Results & Applications

In addition to outperforming state-of-the art methods on six downstream human centric visual tasks including image captioning, action recognition etc., SOLIDER builds new baselines for these tasks too! Furthermore ,the code is available at https://github/tinyvision/SOLIDER .

Conclusion

This paper presents SOLIDER – a novel Semantic cOntrollable seLf superviseD lEarning framework designed specifically for addressing problems related to building general representations from massive unlabeled datasets that can be used in various downstream applications involving humans such as image captioning or person reidentification etc.. By utilizing prior knowledge from human images along with introducing a conditional network with a semantic controller ,SOLIDE allows users greater flexibility when selecting appropriate representations depending upon their specific application needs .

Created on 11 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

58.9%

Learning Human Motion Representations: A Unified Perspective

cs.CV

57.8%

Big Data driven Product Design: A Survey

cs.HC

57.4%

data2vec: A General Framework for Self-supervised Learning in Speech, Vision …

cs.LG

57.3%

Emerging Properties in Self-Supervised Vision Transformers

cs.CV

56.8%

An Empirical Study of Training Self-Supervised Visual Transformers

cs.CV

56.7%

Self-Supervised Pretraining and Controlled Augmentation Improve Rare Wildlife…

cs.CV

56.6%

Localized Region Contrast for Enhancing Self-Supervised Learning in Medical I…

cs.CV

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.