MotionCLIP: Exposing Human Motion Generation to CLIP Space

AI-generated keywords: MotionCLIP CLIP Text-to-Motion Disentanglement Semantic Interpolation

AI-generated Key Points

  • MotionCLIP is a 3D human motion auto-encoder with a disentangled and well-behaved latent embedding.
  • The model aligns with the Contrastive Language-Image Pre-training (CLIP) model's latent space, infusing the manifold with CLIP's rich semantic knowledge.
  • The transformer-based motion auto-encoder reconstructs motion while being aligned to its text label's position in CLIP-space.
  • MotionCLIP leverages CLIP's visual understanding and aligns motion to rendered frames in a self-supervised manner for unprecedented text-to-motion abilities.
  • The introduced latent space can be leveraged for motion interpolation, editing, and recognition.
  • Limitations include struggling to understand directions or capturing certain styles accurately.
  • MotionCLIP opens up several novel research opportunities such as generating signature motions of cultural figures and phrases.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Guy Tevet, Brian Gordon, Amir Hertz, Amit H. Bermano, Daniel Cohen-Or

License: CC BY-SA 4.0

Abstract: We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual descriptions. MotionCLIP gains its unique power by aligning its latent space with that of the Contrastive Language-Image Pre-training (CLIP) model. Aligning the human motion manifold to CLIP space implicitly infuses the extremely rich semantic knowledge of CLIP into the manifold. In particular, it helps continuity by placing semantically similar motions close to one another, and disentanglement, which is inherited from the CLIP-space structure. MotionCLIP comprises a transformer-based motion auto-encoder, trained to reconstruct motion while being aligned to its text label's position in CLIP-space. We further leverage CLIP's unique visual understanding and inject an even stronger signal through aligning motion to rendered frames in a self-supervised manner. We show that although CLIP has never seen the motion domain, MotionCLIP offers unprecedented text-to-motion abilities, allowing out-of-domain actions, disentangled editing, and abstract language specification. For example, the text prompt "couch" is decoded into a sitting down motion, due to lingual similarity, and the prompt "Spiderman" results in a web-swinging-like solution that is far from seen during training. In addition, we show how the introduced latent space can be leveraged for motion interpolation, editing and recognition.

Submitted to arXiv on 15 Mar. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2203.08063v1

MotionCLIP is a novel 3D human motion auto-encoder that features a disentangled and well-behaved latent embedding which supports highly semantic textual descriptions. The model's unique power lies in its alignment with the Contrastive Language-Image Pre-training (CLIP) model's latent space. This alignment infuses the manifold with CLIP's rich semantic knowledge, enabling continuity by placing semantically similar motions close to one another and disentanglement inherited from the CLIP-space structure. The transformer-based motion auto-encoder is trained to reconstruct motion while being aligned to its text label's position in CLIP-space. Additionally, MotionCLIP leverages CLIP's visual understanding and injects an even stronger signal through aligning motion to rendered frames in a self-supervised manner. This enables unprecedented text-to-motion abilities such as out-of domain actions, disentangled editing, abstract language specification and semantic interpolation between two motions. For instance, when prompted with "couch," MotionCLIP generates a sitting down motion due to lingual similarity; similarly when prompted with "Spiderman," it produces a web swinging like solution that was not seen during training. The introduced latent space can also be leveraged for motion interpolation, editing and recognition. However, MotionCLIP has limitations such as struggling to understand directions or capturing certain styles accurately. Nonetheless it opens up several novel research opportunities such as generating signature motions of cultural figures and phrases. Overall, MotionCLIP represents a significant advancement in human motion generation by exposing it to CLIP space and leveraging its rich semantic knowledge for unprecedented text-to-motion abilities.
Created on 04 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.