Transformer-Based Visual Segmentation: A Survey

2 min readJun 12, 2023

paper:
https://arxiv.org/abs/2304.09854
code:
https://github.com/lxtGH/Awesome-Segmentation-With-Transformer

Segmentation is a fundamental visual task that aims to divide input images, videos, point clouds, etc., into semantically meaningful regions. As an important scene understanding task, segmentation has a wide range of applications, including autonomous driving, robot navigation, and short video analysis.

In the era of deep learning, the field of segmentation has made significant breakthroughs in various subdomains by using convolutional neural networks (CNN) as the basis, particularly with fully convolutional networks.

Recently, some methods based on Transformers have also made breakthroughs in various directions in the fields of natural language processing (NLP) and computer vision (CV). Compared to CNN models, Transformer models have a more flexible structure and are better suited for inputs that are multimodal and multitask-oriented.

In the fields of segmentation and object detection, models based on Transformers have also achieved leading results on various benchmarks. Since the emergence of models like Vision Transformer (ViT) and DETR (Detection Transformer), the latest research methods in various subdomains have been built upon the foundational framework of Transformers as the backbone network and decoder.

Given the recent rapid progress in this field, our group has conducted a systematic review and summary of the entire domain. The content of the survey includes an introduction to foundational knowledge and task settings, an overview of the basic concepts of Transformers, a comprehensive review of CNN-based segmentation models, a review of segmentation models based on Transformers, evaluation and testing of relevant benchmark datasets, and exploration of potential future research directions.

Transformer-Based Visual Segmentation: A Survey

Written by OpenMMLab

No responses yet