Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

AI-generated keywords: Cosmos-Reason1 Physical AI reasoning physical common sense embodied reasoning multimodal large language models

AI-generated Key Points

  • Models aim to understand the physical world and make decisions through long chain-of-thought reasoning processes
  • Focus on key capabilities for physical common sense and embodied reasoning
  • Utilization of hierarchical ontology for representing fundamental knowledge about space, time, and physics
  • Two-dimensional ontology used for embodied reasoning across different physical embodiments
  • Development of Cosmos-Reason1 models involves data curation and training in four stages: vision pre-training, general supervised fine-tuning (SFT), Physical AI SFT, and Physical AI reinforcement learning (RL)
  • Evaluation benchmarks built according to ontologies show significant improvements with Physical AI SFT and reinforcement learning
  • Rule-based cleaning and rewriting stage implemented to produce valid SFT samples for reasoning annotations
  • Curated datasets include free-form questions from high-quality video clips with human-annotated captions, as well as multiple-choice questions to test model capabilities
  • Detailed descriptions used to construct understanding MCQs and reasoning MCQs
  • Curation pipeline applied across various datasets such as BridgeData V2 for robotic manipulation behaviors and RoboVQA for robotics-focused visual question answering
  • Each dataset presents unique challenges related to physical common sense and embodied reasoning
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: NVIDIA, :, Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo Li, Xuan Li, Tsung-Yi Lin, Yen-Chen Lin, Ming-Yu Liu, Andrew Mathau, Yun Ni, Lindsey Pavao, Wei Ping, David W. Romero, Misha Smelyanskiy, Shuran Song, Lyne Tchapmi, Andrew Z. Wang, Boxin Wang, Haoxiang Wang, Fangyin Wei, Jiashu Xu, Yao Xu, Xiaodong Yang, Zhuolin Yang, Xiaohui Zeng, Zhe Zhang

License: CC BY 4.0

Abstract: Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-8B and Cosmos-Reason1-56B. We curate data and train our models in four stages: vision pre-training, general supervised fine-tuning (SFT), Physical AI SFT, and Physical AI reinforcement learning (RL) as the post-training. To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and reinforcement learning bring significant improvements. To facilitate the development of Physical AI, we will make our code and pre-trained models available under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-reason1.

Submitted to arXiv on 18 Mar. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2503.15558v1

The models aim to understand the physical world and generate appropriate embodied decisions through long chain-of-thought reasoning processes. The focus is on key capabilities for , including and . To represent physical common sense, a hierarchical ontology capturing fundamental knowledge about space, time, and physics is utilized. For embodied reasoning, a two-dimensional ontology generalizing across different physical embodiments is employed. The development of two multimodal large language models, Cosmos-Reason1-8B and Cosmos-Reason1-56B, involves data curation and training in four stages: vision pre-training, general supervised fine-tuning (SFT), Physical AI SFT, and Physical AI reinforcement learning (RL) as post-training. Evaluation benchmarks for physical common sense and embodied reasoning are built according to the ontologies, showing significant improvements with Physical AI SFT and reinforcement learning. Furthermore, a rule-based cleaning and rewriting stage is implemented to produce valid SFT samples for reasoning annotations. The curated datasets include free-form questions from high-quality video clips with human-annotated captions, as well as multiple-choice questions to test model capabilities. Additionally, detailed descriptions are used to construct understanding MCQs and reasoning MCQs. The curation pipeline is applied across various datasets such as BridgeData V2 for robotic manipulation behaviors and RoboVQA for robotics-focused visual question answering. Each dataset presents unique challenges and tasks related to physical common sense and embodied reasoning. Overall,the refined summary highlights the comprehensive approach taken in developing the Cosmos-Reason1 models for advancing Physical AI research.
Created on 22 Mar. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.