Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction
Paper:https://arxiv.org/pdf/2302.07817.pdf
GitHub:https://github.com/wzzheng/TPVFormer
Modern methods for vision-centric autonomous driving perception widely adopt the bird’s-eye-view (BEV) representation to describe a 3D scene. Despite its better efficiency than voxel representation, it has difficulty describing the fine-grained 3D structure of a scene with a single plane. This talk focuses on the latest investigation of 3D scene representation and its application on 3D semantic occupancy prediction. It is based on the CVPR 2023 paper “Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction” from Tsinghua University.
Current vision-centric autonomous driving perception mainly focuses on 3D object detection, however, the predicted 3D boxes cannot well represent objects of arbitrary shape. Inspired by this, we mainly study a new task of vision-based 3D semantic occupancy prediction. The input is surround-camera images, and the goal is to predict whether each point in the 3D space is occupied by an object and the category of this object.
For this task, this paper proposes a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes to describe the fine-grained structure of a 3D scene. To lift image features to the 3D TPV space, we further propose a transformer-based TPV encoder (TPVFormer) to obtain the TPV features effectively.
Taking camera images as inputs, the proposed TPVFormer only uses sparse LiDAR semantic labels for training but can effectively predict the semantic occupancy for all voxels. Also, TPVFormer is the first to demonstrate the potential of vision-based methods on LiDAR segmentation.
The code has been released on GitHub (https://github.com/wzzheng/TPVFormer).
We will keep it updated to support more 3D semantic occupancy prediction models, methods, and data! Thanks for MMDection3d (https://github.com/open-mmlab/mmdetection3d)