Max Jaderberg 2016 (26 Jul 2018 Reading Notes)

The Spatial Transformer mechanism addresses the issues above by providing Convolutional Neural Networks with explicit spatial transformation capabilities. It possesses 3 defining properties that make it very appealing.

modular: STNs can be inserted anywhere into existing architectures with relatively small tweaking (对…稍作调整).

differentiable: STNs can be trained with backprop allowing for end-to-end training of the models they are injected in.

dynamic: STNs perform active spatial transformation on a feature map for each input sample as compared to the pooling layer which acted identically for all input samples.

The Spatial Transformer module consists in three components:

1. localisation network

[goal] spit out the parameters θ of the affine transformation that’ll be applied to the input feature map.

[input] feature map U of shape (H, W, C)

[output] transformation matrix θ of shape (6,)

[architecture] fully-connected network or ConvNet

2. grid generator

[goal] to output a parametrised sampling grid, which is a set of points where the input map should be sampled to produce the desired transformed output.

[architecture]

3. sampler

[goal] To perform a spatial transformation of the input feature map, a sampler must take the set of sampling points Tθ(G), along with the input feature map U and produce the sampled output feature map V.

[architecture]

It is also possible to use spatial transformers to downsample or oversample a feature map, as one can define the output dimensions H’ and W’ to be different to the input dimensions H and W. However, with sampling kernels with a fixed, small spatial support (such as the bilinear kernel), downsampling with a spatial transformer can cause aliasing effects.

Performance

Reference

Spatial transformer networks SLIDESAHRE link

Deep Learning Paper Implementations: Spatial Transformer Networks:

link1

link2