Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving

AI-generated keywords: autonomous driving closed-set 3D perception models multi-modal auto labeling pipeline open-vocabulary semantic labels unsupervised 3D perception

AI-generated Key Points

  • Accurate and reliable models are crucial in the field of autonomous driving
  • Existing models may not be sufficient for safety critical applications as they are trained on a pre-defined set of object categories
  • Proposed method can generate amodal 3D bounding boxes and tracklets for training models on open-set categories without 3D human labels
  • Pipeline leverages motion cues in point cloud sequences and 2D image-text pairs to identify and track traffic participants
  • Proposed method can handle both static and moving objects in an unsupervised manner
  • Capable of outputting open-vocabulary features through vision-language knowledge distillation technique
  • Experiments conducted on Waymo Open Dataset show superior performance compared to prior work on unsupervised 3D perception tasks
  • Pipeline involves extracting vision-language and motion features, proposing, tracking, and completing bounding boxes, utilizing all available cameras, transferring 2D features to 3D LiDAR points, incorporating point cloud data captured over time using LiDAR sensors
  • Novel approach for unsupervised multi-modal auto labeling in autonomous driving
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Mahyar Najibi, Jingwei Ji, Yin Zhou, Charles R. Qi, Xinchen Yan, Scott Ettinger, Dragomir Anguelov

ICCV 2023
License: CC BY 4.0

Abstract: Closed-set 3D perception models trained on only a pre-defined set of object categories can be inadequate for safety critical applications such as autonomous driving where new object types can be encountered after deployment. In this paper, we present a multi-modal auto labeling pipeline capable of generating amodal 3D bounding boxes and tracklets for training models on open-set categories without 3D human labels. Our pipeline exploits motion cues inherent in point cloud sequences in combination with the freely available 2D image-text pairs to identify and track all traffic participants. Compared to the recent studies in this domain, which can only provide class-agnostic auto labels limited to moving objects, our method can handle both static and moving objects in the unsupervised manner and is able to output open-vocabulary semantic labels thanks to the proposed vision-language knowledge distillation. Experiments on the Waymo Open Dataset show that our approach outperforms the prior work by significant margins on various unsupervised 3D perception tasks.

Submitted to arXiv on 25 Sep. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2309.14491v1

In the field of , it is crucial to have accurate and reliable . However, these models may not be sufficient for safety critical applications as they are trained on a pre-defined set of object categories. This poses a problem when new object types are encountered after deployment, as these models may not be able to accurately detect and track them. To address this issue, the authors propose a that can generate amodal 3D bounding boxes and tracklets for training models on open-set categories without the need for 3D human labels. The pipeline leverages motion cues inherent in point cloud sequences along with freely available 2D image-text pairs to identify and track all traffic participants. Unlike previous studies in this domain that can only provide class-agnostic auto labels limited to moving objects, the proposed method can handle both static and moving objects in an unsupervised manner. Additionally, it is capable of outputting thanks to the vision-language knowledge distillation technique. The authors conduct experiments on the Waymo Open Dataset to evaluate their approach. The results show that their method outperforms prior work by significant margins on various unsupervised 3D perception tasks. Figure 3 provides an overview of the unsupervised multi-modal auto labeling approach. The pipeline begins by extracting vision-language and motion features from multiple modalities. These features are then used to propose, track, and complete bounding boxes of objects. The resulting pointwise vision-language features, 3D bounding boxes, and tracklets serve as automatic supervisions to train the perception model. To extract open-vocabulary features, the authors utilize all available cameras to capture images at each time step. These 2D features are then transferred to 3D LiDAR points using known sensor calibrations. The point cloud data captured over time using LiDAR sensors is also incorporated into the feature extraction process. Overall, this paper presents a novel approach for in autonomous driving. By combining motion cues and vision-language knowledge distillation, the proposed method can effectively handle both static and moving objects without the need for human-labeled 3D data. The experimental results demonstrate the superiority of this approach compared to previous work in various unsupervised 3D perception tasks.
Created on 07 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.