BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

AI-generated keywords: Autonomous Driving 3D Visual Perception BEVFormer Spatiotemporal Transformers NDS Metric

AI-generated Key Points

  • Autonomous driving systems require 3D visual perception tasks such as 3D detection and map segmentation based on multi-camera images.
  • A new framework called BEVFormer has been proposed to learn unified Bird's-Eye-View (BEV) representations that support multiple autonomous driving perception tasks.
  • The model employs spatial cross-attention and temporal self-attention to extract spatial features from regions of interest across camera views for each BEV query and recurrently fuse historical BEV information.
  • The proposed approach achieves state-of-the-art performance with an NDS metric score of 56.9% on the nuScenes test set which is 9.0 points higher than previous best arts and comparable to LiDAR-based baselines' performance.
  • It significantly improves velocity estimation accuracy and object recall under low visibility conditions, but some mistakes in small and remote objects were observed in visualization results.
  • Camera-based methods still have a gap compared to LiDAR-based methods in terms of effectiveness and efficiency.
  • The proposed framework has tremendous potential for further exploration in improving perception accuracy in 3D space critical for autonomous driving systems' safe operation.
  • The code is available at https://github.com/zhiqi-li/BEVFormer.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, Jifeng Dai

Accepted to ECCV 2022
License: CC BY 4.0

Abstract: 3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9\% in terms of NDS metric on the nuScenes \texttt{test} set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. We further show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The code is available at \url{https://github.com/zhiqi-li/BEVFormer}.

Submitted to arXiv on 31 Mar. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2203.17270v2

The development of autonomous driving systems requires the ability to perform 3D visual perception tasks such as 3D detection and map segmentation based on multi-camera images. To this end, a new framework called BEVFormer has been proposed which utilizes spatiotemporal transformers to learn unified Bird's-Eye-View (BEV) representations that support multiple autonomous driving perception tasks. The model employs spatial cross-attention to extract spatial features from regions of interest across camera views for each BEV query and temporal self-attention to recurrently fuse historical BEV information. The proposed approach achieves state-of-the-art performance with an NDS metric score of 56.9% on the nuScenes test set which is 9.0 points higher than previous best arts and comparable to LiDAR-based baselines' performance. Furthermore, it significantly improves velocity estimation accuracy and object recall under low visibility conditions. Visualization results show impressive outcomes except for some mistakes in small and remote objects. The limitations of camera-based methods are acknowledged as they still have a gap compared to LiDAR-based methods in terms of effectiveness and efficiency. Overall, the proposed framework has tremendous potential for further exploration in improving perception accuracy in 3D space critical for autonomous driving systems' safe operation. The code is available at https://github.com/zhiqi-li/BEVFormer.
Created on 20 May. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.