Research Summary
Object Detection upto Fast-RCNN

Published in

Nerd For Tech

11 min readJun 12, 2021

Hello Everyone,I am Aditya Raj, a 2nd year student at IIITA and remotely working as Machine learning Engineer at yellowbacks.com.

Here I would be explaining Object detection from very scratch upto fast-RCNN. NO prerequisites for these,everything is explained from base.This story will give you a beautiful insight into architectures and working of Object detection in deep learning.

I will cover yolo versions and some other advanced techniques like detectron,mask RCNN ahead in blogs.

Let’s Start:

Introduction

Pre-required sections:-

Images in Computer Vision/Deep Learning
Convnets
Image Classification vs Object Detection
Object detection with classification
RCNN
Image feature extraction and Spatial Pyramid Pooling(SPP)
SPPNet

Main section:- Research Summary for Fast-RCNN

Note:- Active Deep-Learning researchers can ignore pre-required sections and can directly read the main section just to be updated.

Pre-required Section

Images in computer vision/Deep Learning

Every image in the field of computer vision or deep learning can be considered as a matrix of some dimension (eg: 224x224) with each value of matrix known as pixel value that represents the intensity of color of that point or part in the image.

If we closely look at the black and white image beside,it’s representation in the matrix form has pixel value near 0.255 in white region while 0 in black region.

Source of image:- miro.medium.com

The B&W image above was a 2-D matrix with each value of matrix as color intensity representation value of that part of image or pixel value. In case of color images,we assume them as combinations of varying intensity of three colors-Red,Green,Blue(RGB),so we have 2D matrix for each of them with their pixel value representing respective color intensity.

— — 3 dim matrix

Source of image:- researchgate.net

Now what we get is a 3-D matrix for RGB or normal color images we use in daily life.These are the matrices on which we will apply deep learning methods for image classification,object detection and other operations.

Convnets

Convnets or convolutional neural network is the neural network architecture used to extract features from the images by convolution operation(not normal matrix multiplication),pooling,etc. on the matrix formed from images along with certain techniques.

convolution operation

Convolution Kernels

Image source:- analytics Vidhya

A kernel is a simple 2D matrix with shape depending on the input image.Kernel maps on input image matrix with simple element-wise multiplication and linear addition giving a lower dimensional matrix as result.Kernel moves along the matrix as per the sliding window method.

See the image below for better understanding of convolution operation with the kernel.

Image source:- Analytics Vidhya

First entry of the above given convolution result can be explained as:-

Image source:- Analytics Vidhya

45*0 + 12*(-1) + 5*0 + 22*(-1) + 10*5 + 35*(-1) + 88*0 + 26*(-1) + 51*0 = -45

The output matrix shape and result also depends on the steps of no. of rows/columns slided by the kernel ,called stride.

The output matrix shape can also be reshaped by placing new rows and columns on the output,this is called padding.

If all the values of the newly added row/column is zero,it is called zero padding.

These all kernel convolution,striding,padding in combination is called convolution operation on the image matrix.

Pooling

Pooling operation summarizes the feature map of the image matrix by sliding a 2-D filter(or kernel).

Types of pooling:-

i.)max-pooling:- It selects the maximum element from the feature map covered by the filter.

ii.)average-pooling:-It selects the average of all elements in the feature map covered by filter or kernel.

iii.)global-pooling:-It reduces the entire feature map into single value,(it may be global max or global average pooling). Image source for images above=geeks for geeks

Convolution-architectures

The convolution architecture consists of a series of convolution and pooling operations in the form of layers,also called convolution layer and pooling layer.

There are many famous CNN architectures with predefined no. of layers,kernel size,filter size for every convolution or pooling layer.

Example :- VGG16

Image source:- geeks for geeks

We can now easily understand this famous architecture,there are some convolution layers,some with ReLU activations and some max pooling layers with softmax as last layer to give class probability output.

Image Classification vs Object Detection

Image classification:- It is basically classification of image by training on data with the help of features extracted by CNN architectures.

Image Localization:- This localizes the object present inside the image i.e separation of object from the background with the help of some operations on image matrix,with help of some computer vision techniques like sliding window,bag of visual word and selective search etc. .

Object detection:- when there are multiple objects present inside an image, first with the help of image localization, all the objects are localized in the given image and then CNN architectures and classification algorithms predict that object with deep learning. Image source :- geeksforgeeks

Object detection with classification

sliding window:- In this algorithm we choose a grid of certain small dimensions and slide it over the input image with given stride.The part of input image matrix covered with grid is convoluted with any CNN algorithm to predict class of that part of image.It may be any object or background.The grid slides over whole input image and every part is convoluted. Image:- cogneethi.com

Problem — we need to apply CNN to huge no. of locations and scales,so it is computationally very expensive and very memory inefficient.

Region Proposal:- Here segments of input image matrix which are more likely to be any object are localized by some region proposal algorithms and then CNN is done on them.

We find ‘blobby’ image regions that are likely to contain objects with the help of ‘selective search’ algorithm. Generally 1000–2000 regions like that(region proposals) are generated and then CNN with classification is on all.

Selective search:- i.)We will generate segments of input image using the method in the paper “Efficient Graph-Based Image Segmentation “ by Felzenszwalb et al

ii.)we will combine smaller segments into larger segments recursively with a greedy algorithm given below.

Greedy Algorithm : 1. choose two most similar regions from the set of regions. 2. Larger region is made by combining those two smaller regions. 3. These steps are repeated for multiple iterations. Image source:- geeksforgeeks

R-CNN(Region CNN)

Region CNN is CNN based image region classification algorithm trained to predict both class of region or object as well as it’s position.Let’s understand it’s working in steps:-

2000 proposed regions in the form of matrices (image parts likely to contain objects) taken out from input image matrix using region proposal method based on selective search algorithm. Image:-FAIR(facebook AI research)

2. Now all these Regions of Interest(RoI) matrices are warped or resized into matrices having dimension such that it can be given as input into the CNN for feature extraction. Image :- FAIR(facebook AI research)

3. Now these warped image regions in form of matrices are given input to convnets for feature extraction.

4. The features extracted from the CNNs are further passed on to the classifier(SVM in this case) to predict the class of detected objects.Image :- FAIR(facebook AI research)

5. Along with predicting class of object,we also need to specify the region where object is present in image,i.e. Predict the position of ROI in input image matrix.This can be done by adding bounding box regressor along with SVM after CNN layers.It will give output in form of co-ordinates of box(box is a rectangle denoting ROI in input matrix) i.e output=(x1,y1,x2,y2).To train it,least square loss(L2 loss) minimization was used.

Image source :- FAIR(facebook AI research)

Now in the RCNN, to improve accuracy several steps were taken like:-

i.) The CNN architecture was fine tuned on region proposals instead of older images.

ii.) During the training time, the final feature extracted was also fine tuned with log loss(softmax layer),along with SVM and bounding box regressor.

iii.)softmax layer was only upto training time,during testing or object detection,only SVM and Bbox reg. are present.

Drawbacks of RCNN

1.) It is a three stage training network making it very slow:-

Fine-tuning network with softmax classifier(log-loss).
Training network with linear SVM(hinge loss).
Training network with bounding box regressor(least square loss).

2.) CNN architectures are applied on every single ROI making it very inefficient in memory and time.

3.) Training it takes 84 hrs and inference (detection) takes 47sec per image with VGG 16 making it very slow to use.

Spatial Pyramid Pooling(SPP)

Spatial pyramid pooling is a method that maintains spatial information in the local spatial bins. The size and no. of bins are fixed.Responses of all filters are pooled in spatial bins. In the image given below, pooling is in three level.In SPPNet similar three level pooling like this is done.

The first pooling is global max pooling on whole input image giving a single output with 1x1x256 D(depth of feature map).
The second pooling is basically applying global pooling after dividing the input image matrix into four quadrants giving 4x1x256 D output.
The third pooling is also similar but the image is divided into 16 quadrants giving output 16x1x256 D.
All the three outputs are concatenated to give a fixed 21x1x256 D output.
Here in SPP, we can see the output will be of fixed length of 21 irrespective of size and dimension of input image matrix.

Another Important benefit of SPP is image resizing,we would not need to warp ROI(that decreases accuracy) as image resizing will be done automatically by SPP making accuracy even better.

Image source:- cogneethi

SPPNet

To counter drawbacks of RCNN,this was introduced.Main reason of inefficiency of RCNN was that CNN had to operate on each of 2000 ROIs.In SPPNet,this problem was solved.

Only one input image matrix was sent to CNN layer to get the feature map.image source:- FAIR

ROI were now selected by application of a selective search algorithm on feature map.image source:- FAIR

Spatial pyramid pooling layer is applied on these ROIs to give fixed dimensional output irrespective of input. Next three images source:- FAIR

The fixed dimensional output of SPP layer is then passed on to dense layer and the resulting final extracted feature is classified with the SVM.

Now along with class of object from a region,we also need to find the location of object so also add a bounding box regressor like in RCNN.

Along with SVM and Bbox regressor, softmax layer is also used during the training time only.

Image source:- cogneethi

Main Section:: Research topic

FAST RCNN

SPPNet was better and faster than RCNN but it still had several drawbacks:-

It was still a three stage training process with softmax layer(log-loss), SVM(hinge-loss), Bbox regressor(L2 loss) that made it slow.
It’s training time was still 25 hrs and also memory inefficient.
We can’t update convnet layers prior to SPP here.

To remove these problems,several changes and upgradations were made on SPPNet to bring FAST RCNN

SPPNet = training:- 3 times faster, test:- 10 to 100 times faster

Fast RCNN = training:- 9 times faster, test:- 0.3 sec

(Comparison w.r.t RCNN)

Therefore,Fast RCNN: a single stage training algorithm that jointly learns to classify object proposals and spatial locations was proposed.

Let us understand changes made in SPPNet that results in Fast-RCNN:-

Instead of using multi-stage SPP: L0(1x1),L1(2x2),L3(4x4)

single layer of SPP with 7x7 grid was used here:

Next 3 Images source :- cogneethi

2. Instead of using softmax in training along with SVM and using SVM as classifier,we will only use a single softmax classifier both in training and as classifier.

Earlier in SPPNet Now in Fast-RCNN

3. For Bounding box regressor:-

i. it’s architecture was changed,two FC layers were added to feature map,from there two different FC layer was connected,one leading to softmax(for classification) and another to Bbox regressor(for predicting position of object in form of [x,y,w,h]) here:- x,y =change in mid posn. of rectangle showing object during training, w,h = change in height and width of rectangle showing object during training.

ii. L1 loss was used in place of L2 loss in training bounding box regression.

Earlier Bbox architecture Current Bbox architecture

4. Stage of training was reduced,

Initially SVM + softmax was reduces to just softmax

Now,

classfn(log-loss) + Bbox reg.(L2 Loss) => jointed to become a single loss function that will be minimized in backpropagation.

So network reduced from three stage loss fn. to single stage loss fn.

5. We know first few layers of CNN are basically shape and edge detectors that are common everywhere so to have further optimization and time reduction we will train CNN layer in our Fast-RCNN from 3rd or 4th layer.

Fast-RCNN architecture