Learning Human Motion Representations: A Unified Perspective

AI-generated keywords: Human-centric video tasks Unified Representation Dual-stream Spatio-temporal Transformer Heterogeneous Data Resources Fine-tuning

AI-generated Key Points

  • There is a need for a unified perspective in human-centric video tasks
  • The proposed framework includes a pretraining stage that trains a motion encoder to recover 3D motion from noisy partial 2D observations
  • Motion representations incorporate geometric, kinematic, and physical knowledge about human motion, making them easily transferable to multiple downstream tasks
  • Dual-stream Spatio-temporal Transformer (DSTformer) neural network was used to implement the motion encoder
  • Heterogeneity of available data resources is a significant challenge in developing a unified representation
  • The proposed framework achieves state-of-the-art performance on all three downstream tasks by fine-tuning the pretrained motion encoder with a simple regression head
  • The framework has potential for future applications as it mines and utilizes commonalities across different tasks despite having models designed for different problems that have learned typical human motion patterns.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Wentao Zhu, Xiaoxuan Ma, Zhaoyang Liu, Libin Liu, Wayne Wu, Yizhou Wang

Project page: https://motionbert.github.io/
License: CC BY-SA 4.0

Abstract: We present a unified perspective on tackling various human-centric video tasks by learning human motion representations from large-scale and heterogeneous data resources. Specifically, we propose a pretraining stage in which a motion encoder is trained to recover the underlying 3D motion from noisy partial 2D observations. The motion representations acquired in this way incorporate geometric, kinematic, and physical knowledge about human motion, which can be easily transferred to multiple downstream tasks. We implement the motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer) neural network. It could capture long-range spatio-temporal relationships among the skeletal joints comprehensively and adaptively, exemplified by the lowest 3D pose estimation error so far when trained from scratch. Furthermore, our proposed framework achieves state-of-the-art performance on all three downstream tasks by simply finetuning the pretrained motion encoder with a simple regression head (1-2 layers), which demonstrates the versatility of the learned motion representations.

Submitted to arXiv on 12 Oct. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2210.06551v2

In the field of human-centric video tasks, there is a need for a unified perspective that can tackle various challenges by learning human motion representations from large-scale and heterogeneous data resources. This is where the proposed framework comes in, which presents a pretraining stage that trains a motion encoder to recover 3D motion from noisy partial 2D observations. The resulting motion representations incorporate geometric, kinematic, and physical knowledge about human motion, making them easily transferable to multiple downstream tasks. To implement the motion encoder, the Dual-stream Spatio-temporal Transformer (DSTformer) neural network was used. It captures long-range spatio-temporal relationships among skeletal joints comprehensively and adaptively, leading to the lowest 3D pose estimation error when trained from scratch. One significant challenge in developing a unified representation is the heterogeneity of available data resources. Motion capture systems provide high-fidelity 3D motion data but are limited in terms of captured video appearances. Action recognition datasets offer annotations of action semantics but lack human pose labels or feature limited daily activities' motions. In contrast, in-the-wild human videos offer diverse range appearance and motion but require considerable effort to obtain precise 2D pose annotations. The proposed framework addresses this challenge by providing a new perspective on learning human motion representations that can be shared across all relevant tasks. It achieves state-of-the-art performance on all three downstream tasks by simply fine-tuning the pretrained motion encoder with a simple regression head (1–2 layers), demonstrating the versatility of learned motion representations. The proposed framework has potential for future applications as it mines and utilizes commonalities across different tasks despite having models designed for different problems that have learned typical human motion patterns. In conclusion, this work provides an innovative approach to tackling various human-centric video tasks using learned humanmotion representations from large scale and heterogeneous data resources through pretraining and fine tuning with a simple regression head. The proposed framework is versatile and achieves stateof -the art performance on all three downstream tasks demonstrating its potential for future applications.
Created on 02 May. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.