BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
AI-generated Key Points
- Autonomous driving systems require 3D visual perception tasks such as 3D detection and map segmentation based on multi-camera images.
- A new framework called BEVFormer has been proposed to learn unified Bird's-Eye-View (BEV) representations that support multiple autonomous driving perception tasks.
- The model employs spatial cross-attention and temporal self-attention to extract spatial features from regions of interest across camera views for each BEV query and recurrently fuse historical BEV information.
- The proposed approach achieves state-of-the-art performance with an NDS metric score of 56.9% on the nuScenes test set which is 9.0 points higher than previous best arts and comparable to LiDAR-based baselines' performance.
- It significantly improves velocity estimation accuracy and object recall under low visibility conditions, but some mistakes in small and remote objects were observed in visualization results.
- Camera-based methods still have a gap compared to LiDAR-based methods in terms of effectiveness and efficiency.
- The proposed framework has tremendous potential for further exploration in improving perception accuracy in 3D space critical for autonomous driving systems' safe operation.
- The code is available at https://github.com/zhiqi-li/BEVFormer.
Authors: Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, Jifeng Dai
Abstract: 3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9\% in terms of NDS metric on the nuScenes \texttt{test} set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. We further show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The code is available at \url{https://github.com/zhiqi-li/BEVFormer}.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.