19 Jul 2018 Reading Note

(1) DetNet not only maintains high spatial resolution in deeper layers but also keeps large receptive field for object detection task (not like traditional backbone network).

(2) first to analyze the inherent drawbacks of traditional ImageNet pre-trained model for fine-tunning recent object detectors.

(3) DetNet achieves new state-of-the-art results on MSCOCO object detection and instance segmentation track based on a low complexity DetNet59 backbone.

Background

The performance of object detection has been significantly improved due to the rapid progress of deep convolutional neural networks In the Paper: [8,9,10,11,12,13,14,15,16,17].

CNN based object detectors can be categorized into:

one-stage detector	two-stage detectors
YOLO Reading Note	Faster R-CNN Reading Note
SSD Reading Note	R-FCN Reading Note
RetinaNet Reading Note	FPN Reading Note

Motivation 1

Due to the GAP between the image classification and object detection:

Image Classification	Object Detection
higher spatial resolution -> higher receptive field -> more beneficial to the visual classification	higher spatial resolution -> compromises the large objects location -> fail to small objects recognision

However: high resolution feature maps bring more challenges to build a deep neural network due to the computational and memory cost.

To keep the efficiency of our DetNet, they employ a low complexity dilated bottleneck structure.

Motivation 2

Recent two-stage object detectors RELY ON a backbone network which is pretrained on the ImageNet classification dataset. As

The design principles for the image classification is not good for the localization task (spatially localize the bounding-boxes)

Because: the spatial resolution of the feature maps is gradually decreased for the standard networks like VGG16 Reading Note and Resnet Reading Note.

Feature Pyramid Network (FPN) Reading Note and dilation are applied to these networks to maintain the spatial resolution.

However: still exists the following three problems when trained with these backbone networks.

1. The number of network stages is different

typical classification network involves 5 stages, with each stage down-sampling feature maps by pooling 2x or stride 2 convolution. Thus the output feature map spatial size is “32x” sub-sampled.

2. Weak visibility of large objects

The feature map with strong semantic information has strides of 32 respect to input image, which brings large valid receptive field and leads the success of ImageNet classification task.

However, large stride is harmful for the object localization. In Feature Pyramid Networks, large object is generated and predicted within deeper layers, the boundary of these object may be too blurry to get an accurate regression. This case is even worse when more stages are involved into classification network, since more down-sampling brings more strides to object.

3. nvisibility of small objects

Another drawback of large stride is the missing of small objects. The information from the small objects will be easily weaken as the spatial resolution of the feature maps is decreased and the large context information is integrated. Therefore, Feature Pyramid Network predicts small object in shallower layers.

However, shallow layers usually only have low semantic information which may be not sufficient to recognize the category of the object instances. Therefore detectors must enhance their classification capability by involving context cues of high-level representations from the deeper layers. Feature Pyramid Networks relieve it by adopting bottom-up pathway. However, if the small objects is missing in deeper layers, these context cues will miss simultaneously.

Network Structure

To address these top problems, DetNet has following characteristics:

(i) The number of stages is directly designed for Object Detection. ——> DetNet has exactly the same number of stages as the detector used, therefore extra stages like P6 can be pre-trained in ImageNet dataset.)

(ii) Even though involving more stages (such as 6 stages or 7 stages) than traditional classification network, we maintain high spatial resolution of the feature maps, while keeping large receptive field ——> Det- Net is more powerful in locating the boundary of large objects and finding the missing small objects.

ResNet-50 as our baseline.

Performance

There are two key-points in object detection evaluation, the one is average precision (AP) and the other is average recall (AR).

AR means how much objects we can find out, AP means how much objects is correctly predicted (right label for classification).

AP and AR are usually evaluated on different IoU threshold to validate the regression capability for object location. The larger IoU is, the more accurate regression needs.

AP and AR are also evaluated on different range of bounding box areas (small, middle, and large) to find the detail influences on the scale objects.

Other Useful Info.

Training strategies provided by facebook Research Detectron repository: Github Link