[CVPR2023]Aligning Bag of Regions for Open-Vocabulary Object Detection
paper:
https://arxiv.org/abs/2302.13996
code:
Open-vocabulary object detection (OVD) aims to detect objects of categories that are not annotated in the training process. A common approach to this task is to distill pre-trained vision-language models (VLMs). By learning the representation of VLMs, the detectors are enabled to recognize novel object categories. Existing methods mainly operate on individual regions, letting the detectors to learn the VLMs’ representation on individual objects. However, the VLMs have learned to represent a group of semantic concepts in their pre-training.
It introduces a novel approach to learning the VLMs’ representation, i.e., distillation on a bag of regions. To effectively exploit the VLMs’ representation on a group of semantic concepts, the paper proposes to sample regions in the neighborhood of region proposals, which results in a bag of spatially and contextually related regions. To represent the bag of regions, the paper proposes to align the region features to word embedding space. Therefore, the individuals in the bag of regions are taken as words in a sentence, allowing the text encoder of the VLM to represent the bag of regions. Finally, the paper adopts a contrastive manner to align the representations of the text encoder to those of the image encoder, indirectly learning the representation of individual regions.
The method in this paper achieves state-of-the-art performance on the OV-COCO and the OV-LVIS benchmarks. It will be open-sourced based on MMDet3.x and we expect more OVD methods to be supported in this repository (ovdet).