Region Proposal Networks — Faster R-CNN Explanation

Nguyễn Thành Công
5 min readJan 23, 2024

--

When I studied Faster R-CNN, I found every current article, story explanations very hard to understand. Because of that, I write the explanation that I have understood myself.

Unlike classification problem, in object detection, instead of just figuring out what kind of the object is, you also need to point out where it’s located in the images. With this reason, before the classification stage, you need to find all location that contains object, it’s called Region Proposal Stage.

In the original architectures like R-CNN and also in Fast R-CNN, Region Proposal stage is done by selective search algorithm, what is an image processing algorithm.

The problem with the selective search algorithm is that there’s a lot of output boxes, which makes R-CNN very slow. That’s why Region Proposal Networks (RPN) came in. It’s a deep learning network that can check whether bounding boxes are object or background and also determine the size of bounding boxes.

Architecture of Faster R-CNN

Region Proposal Networks (RPN)

From the original paper, the input of RPN is:

RPN takes an image (of any size) as input and output a set of rectangular object proposals, each with an objective score

RPN uses a fully convolutional network (FCN) to extract features of the image. This network’s parameters are also used to extract features to predict what object is later.

In the paper, they used ZF and VGG-16 for experiments. In this story, I only use VGG-16 to focus on.

The architecture of VGG-16 is described below. RPN uses pre-trained VGG-16 from first layer until 14x14x512 layers.

https://qiita.com/yukitaka13-1110/items/f4d4b38a24ca07045791

To generate region proposals, we slide a small network over the convolutional feature map output by the last shared convolutional layer. This small network takes as input an n × n spatial window of the input convolutional feature map

This is easy to understand. After getting the feature map output by pre-traiend VGG before, they use an small neuron network for each windows slided on the feature map with size of n x n. In the paper, authors use n = 3.

This neuron network maps there sliding windows to lower dimension space (256-d for ZF and 512-d for VGG with ReLU). After that, this intermediate layer would be passed to two sibling fully connected layers (box-regression layer (reg) and box-classification layer (cls)). Classification layer would output 2k scores, and regression layer would output 4k coordinates.

Anchor boxes

You might wonder why we have 2k scores and 4k coordinates :)) Anchor boxes are your answer.

An anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio

So the anchors are the center points of each sliding windows. Each anchor has an scale and an aspect ratio corresponding. In the paper, they use 3 scales and 3 aspect ratios, yielding k = 9 anchors for each sliding window. So for feature map with size W x H, there are WHk anchors in total.

So at the end, for each sliding-window location, they simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is k. They use multiple scale and aspect ratios because the object could be varied for each kind of objects. It would be explained later. :))

The reg layer has 4k outputs encoding the coordinates of k boxes (x, y of top left and top right), and the cls layer outputs 2k scores that estimate probability of object or not object for each proposal.

Why we need 2 scores for classification? Why just one?

It’s okay. In the paper, they said that:

For simplicity we implement the cls layer as a two-class softmax layer. Alternatively, one may use logistic regression to produce k scores.

So that, it’s no problem if you use sigmoid activation here. But in the paper, they ouput two-class for simplicity, what after would passed to softmax function.

Translation-Invariant Anchors

An important property of our approach is that it is translation invariant, both in terms of the anchors and the functions that compute proposals relative to the anchors. If one translates an object in an image, the proposal should translate and the same function should be able to predict the proposal in either location

There sentences mean this approach could archive translation invariant. When an object translates in image, it could calculate this sucessfully.

Translation Invarient

Multi-Scale Anchors as Regression References

There have been two popular ways for multi-scale predictions.

The first way is based on image/feature pyramids by using deep CNN on images with many scales. But it is often time-comsuming. (a).

The second way is to use sliding windows of multuple scales on the feature maps. (b).

The anchor-based method is build on a pyramid of anchors (c).

--

--

Responses (1)