Beyond Appearance: a Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks
AI-generated Key Points
- Human-centric visual tasks are important and have many applications
- The paper proposes a framework called SOLIDER for creating a general human representation from unlabeled images
- SOLIDER uses prior knowledge from human images to build pseudo semantic labels and import more semantic information into the learned representation
- Different downstream tasks require varying ratios of semantic and appearance information, which SOLIDER addresses through a conditional network with a semantic controller
- SOLIDER outperforms state-of-the-art methods on six downstream human-centric visual tasks and builds new baselines for these tasks
- The framework can be used in several applications such as image captioning, action recognition, object detection and segmentation, human parsing and pose estimation
- A pretext task based on predicting token order is utilized during training to improve performance further
Authors: Weihua Chen, Xianzhe Xu, Jian Jia, Hao luo, Yaohua Wang, Fan Wang, Rong Jin, Xiuyu Sun
Abstract: Human-centric visual tasks have attracted increasing research attention due to their widespread applications. In this paper, we aim to learn a general human representation from massive unlabeled human images which can benefit downstream human-centric tasks to the maximum extent. We call this method SOLIDER, a Semantic cOntrollable seLf-supervIseD lEaRning framework. Unlike the existing self-supervised learning methods, prior knowledge from human images is utilized in SOLIDER to build pseudo semantic labels and import more semantic information into the learned representation. Meanwhile, we note that different downstream tasks always require different ratios of semantic information and appearance information. For example, human parsing requires more semantic information, while person re-identification needs more appearance information for identification purpose. So a single learned representation cannot fit for all requirements. To solve this problem, SOLIDER introduces a conditional network with a semantic controller. After the model is trained, users can send values to the controller to produce representations with different ratios of semantic information, which can fit different needs of downstream tasks. Finally, SOLIDER is verified on six downstream human-centric visual tasks. It outperforms state of the arts and builds new baselines for these tasks. The code is released in https://github.com/tinyvision/SOLIDER.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through atree representation
Look for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.