Title: Focal Loss for Dense Object Detection Paper Link
Authors: Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollar
Association: Facebook AI Research (FAIR)
Submission: Feb 2018
…………………………………………………………………………………………………………………
»> Background of Deep Learning in Medical Image Analysis
»> A Review: Robot-Assisted Endovascular Catheterization Technologies:
…………………………………………………………………………………………………………………
»> Image Registration Basic Knowledge
»> Image Registration Literature Review
»> Slice-To-Volume Medical Image Registration Background
…………………………………………………………………………………………………………………
Contributions
(1) RetinaNet is essentially a Feature Pyramid Network (FPN) Reading Note with the cross-entropy loss replaced by Focal loss.
(2) DUE TO class imbalance during training (the main obstacle) impeding (阻碍; 妨碍) one-stage detector, a new loss function called ==Focal loss== which significantly increased the accuracy.
In other words, the focal loss performs the opposite role of a robust loss: it focuses training on a sparse set of hard examples.
Background
Current state-of-the-art object detectors are based on a ==two-stage, proposal-driven mechanism.==
Q: What is two-stage object detectors? The first stage generates a sparse set of candidate object locations. The second stage classifies each candidate location as one of the foreground classes or as background using a convolutional neural network.
Q: What is one-stage object detectors?
However: could a simple one-stage detector achieve similar accuracy?
CNN based object detectors can be categorized into:
one-stage detector | two-stage detectors |
---|---|
YOLO Reading Note | Faster R-CNN Reading Note |
SSD Reading Note | R-CNN Reading Note |
RetinaNet Reading Note | FPN Reading Note |
OverFeat Reading Note |
Motivation
Class Imbalance Problem:
- Training is inefficient as most locations are easy negatives that ==contribute no useful== learning signal;
- the easy negatives can overwhelm training and lead to degenerate models.
Common Solution: To perform some form of hard negative mining that samples hard examples during training OR more complex sampling/reweighing schemes OR introduce a weighting factor α treated as a hyperparameter to set ==to balances the importance of positive/negative examples but it does not differentiate between easy/hard examples.==
Their Solution: focal loss naturally handles the class imbalance and efficiently trains on all examples without sampling and without easy negatives overwhelming the loss and computed gradients ==by reshaping the loss function to down-weight easy examples and thus focus training on hard negatives.==
Focal Loss
To address the problem: imbalance between foreground and background classes during training.
focal loss: add a modulating factor: (1 − pt)^γ to the cross entropy loss, with tunable focusing parameter γ ≥ 0.
In practice, an α-balanced variant of the focal loss
FL(pt) = −αt(1−pt)^γ log(pt).
In practice, as default binary classification models are initialized to have equal probability of outputting either y = −1 or 1, in the presence of class imbalance, the loss DUE TO the frequent class can dominate total loss and cause instability in early training.
To counter this, introduce the concept of a ‘prior’ for the value of p estimated by the model for the rare class (foreground) at the start of training. denote the prior by π and set it so that the model’s estimated p for examples of the rare class is low, e.g. 0.01.
improve training stability for both the cross entropy and focal loss in the case of heavy class imbalance.
RetinaNet
Backbone Network: Feature Pyramid [‘pɪrəmɪd] Network
” Feature Pyramid Network augments a standard convolutional network with a top-down pathway and lateral connections so the network efficiently constructs a rich, multi-scale feature pyramid from a single resolution input image. Each level of the pyramid can be used for detecting objects at a different scale. FPN improves multi-scale predictions from fully convolutional networks (FCN).”
The one-stage RetinaNet network architecture uses a Feature Pyramid Network (FPN) backbone on top of a feedforward ResNet architecture (a) to generate a rich, multi-scale convolutional feature pyramid (b). To this backbone RetinaNet attaches two subnetworks:
Classification Subnet (for classifying anchor boxes (c))
“The classification subnet predicts the probability of object presence at each spatial position for each of the A anchors and K object classes. This subnet is a small FCN attached to each FPN level; parameters of this subnet are shared across all pyramid levels. Its design is simple. Taking an input feature map with C channels from a given pyramid level, the subnet applies four 3×3 conv layers, each with C filters and each followed by ReLU activations, followed by a 3×3 conv layer with KA filters. Finally sigmoid activations are attached.”
We found higher-level design decisions to be more important than specific values of hyperparameters.
Box Regression Subnet (for regressing from anchor boxes to ground-truth object boxes (d))
“In parallel with the object classification subnet, we attach another small FCN to each pyra- mid level for the purpose of regressing the offset from each anchor box to a nearby ground-truth object, if one exists. The design of the box regression subnet is identical to the classification subnet except that it terminates in 4A linear outputs per spatial location, see Figure 3 (d). For each of the A anchors per spatial location, these 4 outputs predict the relative offset between the anchor and the groundtruth box (we use the standard box parameterization from R- CNN). We note that unlike most recent work, we use a class-agnostic bounding box regressor which uses fewer parameters and we found to be equally effective. “
The object classification subnet and the box regression subnet sharing a common structure, use separate parameters.
Performance
There are two key-points in object detection evaluation, the one is average precision (AP) and the other is average recall (AR).
AR means how much objects we can find out, AP means how much objects is correctly predicted (right label for classification).
AP and AR are usually evaluated on different IoU threshold to validate the regression capability for object location. The larger IoU is, the more accurate regression needs.
AP and AR are also evaluated on different range of bounding box areas (small, middle, and large) to find the detail influences on the scale objects.