ECE 417 MP5: Face Detection using Faster RCNN

This MP is an implementation of the Faster RCNN object detector, to perform face detection using a small dataset extracted from the very large WIDER FACE dataset.

0. Browsing the Data

First, we'll create a function that plots face rectangles on top of the image. Then we'll use it to look at some of the data.

The dataset is provided to you as a set of 50 images, with their extracted features, and their labels.

These things are available so that you can look at the data. They are not actually going to be used by the neural net:

These are the things the neural net will actually use. They have been computed for you from the data above:

The two layers of the Faster RCNN are

  1. A $3\times 3$ convolutional layer, with $N_C=512$ input channels, thus
$$\xi^{(1)}[n_1,n_2,d] = \sum_{c=0}^{N_C-1}\sum_{m_1=-1}^{1}\sum_{m_2=-1}^{1}w^{(1)}[m_1,m_2,c,d]x[n_1-m_1,n_2-m_2,c]$$$$h[n_1,n_2,d]=\mbox{ReLU}\left(\xi^{(1)}[n_1,n_2,d]\right),$$

and W1.shape is $(M_1,M_2,N_C,C_D)=(3,3,512,128)$.

  1. A $1\times 1$ convolution, with $N_D=128$ input channels, and $N_A\times N_Y$ outputs. The $N_A\times N_Y$ outputs match the last two dimensions of the target array: they correspond to $N_A$ anchor rectangles ($a$) per position, with $N_Y$ outputs per anchor rectangle, thus:
$$\xi^{(2)}[n_1,n_2,a,k]=\sum_d w^{(2)}[0,0,d,a,k]h[n_1,n_2,d]$$

and W2.shape is $(1,1,N_D,N_A,N_Y)=(1,1,128,9,5)$. Notice that this layer is described as a "$1\times 1$ convolution," and you could implement it that way, where the tensor $h[:,:,:]$ is convolving by the filter $w^{(2)}[:,:,:,:,k]$. Alternatively, you could just implement it as a matrix multiplication, where each vector $h[n_1,n_2,:]$ is multiplied by the matrix $w^{(2)}[0,0,:,:,k]$. Either implementation should work (the ECE 417 staff have tried both).

The W1 and W2 tensors have been initialized to some small random values in the file weights_initial.hdf5:

Neural network weights are never initialized to be all zeros, because if all of the columns in W1 are the same to begin with, then they will also have the same gradients. If they have the same gradient, then they remain the same after training, and so we wind up with a W1 matrix that has the same elements in every row -- a result that's guaranteed to have suboptimal performance. Instead, neural network weights are initialized to some small random values.

Next, let's plot an image, and overlay its reference rectangles on top:

The anchors are the set of anchor rectangles. They are the same for every image. There are $N_A=9$ anchor rectangles, for each of $N_1\times N_2$ pixel locations. Each rectangle has 4 numbers: x, y, width, and height. Thus, anchors is an array of size $(N_1,N_2,N_A,4)$.

There are $N_A$ anchors associated with each of $N_1\times N_2$ positions. The smallest anchor associated with each position is the $0^{\textrm{th}}$ rectangle; let's plot that one first, for all $N_1\times N_2$ positions.

The 9 anchors associated with any given position are 9 different rectangles with 3 different sizes (small, medium, large) $\times$ 3 different aspect ratios (horizontal, square, vertical). Let's plot all 9 of them, for the position in the middle of the image.

Finally: the classification targets and regression targets are encoded into the array called targets. Specifically, the binary classification target for the $a^{\textrm{th}}$ anchor at the $(x,y)^{\textrm{th}}$ position in the $i^{\textrm{th}}$ image is encoded as targets[i,x,y,a,4], and if the classification target is 1, then the regression target is encoded as targets[i,a,x,y,0:4].

The way in which they're coded is given in the paper https://proceedings.neurips.cc/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf, and also in the function submitted.regression_rects, which is provided to you as a utility function. The following code inverts that encoding, to get the original rectangles back again as [x,y,w,h] vectors.

Notice the min(np.log(2),regression_target) here. That keeps the computed rectangle from ever being more than twice as large as the anchor rectangle. I recommend that you use such a limit in your code, because sometimes the neural net outputs get very large.

The Faster-RCNN coding scheme creates a target whenever a reference rectangle and an anchor rectangle have an IoU (intersection-over-union) greater than 0.7. For that reason, there are a lot more target rectangles than there were reference rectangles. When we decode them using target2rect, though, it turns out that they are multiple copies of the same rectangle:

If we plot the target rectangles, though, we should see that the number of distinct target rectangles is just the same as the number of reference rectangles. All of the extra targets are just duplicates of the same original references:

Now that we've seen all of those things for one of the data, go up to the top of this section and choose image #1, or #2, or #49 (any number less than 50), in order to look at a different image.

1. Provided Utility Functions: conv2, conv_layer, and sigmoid

In an attempt to make your life a little easier, you are provided with a some utility functions.

The first one, conv2, just performs a 2d convolution. It is unusual in just one respect: if its inputs have the sizes (N1,N2) and (M1,M2), its output always has the size (N1-M1+1+2$\times$padding,N2-M2+1+2$\times$padding). This is the same as zero-padding H by padding rows and columns prior to convolution, and then convolving using 'valid' mode.

Let's test it by convolving a zero-padded square and an unpadded square; the output should be a square pyramid.

Suppose the two inputs have the same size, and we want the output to be $3\times 3$. We can get this effect by setting padding=1. For example, if we convolve the zero-padded square with itself, we get a $3\times 3$ small pyramid.

The second utility function that you're provided is a full conv_layer, that goes from $N_c$ input channels to $N_d$ output channels. Let's create an input image with squares of three colors in the three corners, and then convolve with the triangle function in order to blur each color, without mixing the colors.

You are also provided with a sigmoid function. This is provided for you mostly because numpy generates NaN if you try to compute np.exp(x) for x<-100 or so. In order to simplify this, the provided function just sets sigmoid(x)==0 for x<-100.

You are also provided with a safe_log function. This is provided for you mostly because numpy generates NaN if you try to compute np.log(x) for x<np.exp(-100) or so. In order to simplify this, the provided function just sets safe_log(x)==0 for x<np.exp(-100).

2. forwardprop

forwardprop should compute forward-propagation through the Faster-RCNN network.

The input is one feature tensor per input image. The feature tensor is the last conv layer, just before the fifth max pooling layer, from a pre-trained image classifier called VGG16 (https://neurohive.io/en/popular-networks/vgg16/). VGG16 assumes that the input image is $224\times 224\times 3$. There are 4 downsampling layers, so the feature tensor is $\frac{224}{2^4}\times\frac{224}{2^4}=14\times 14$ pixels, from which we've chosen the center $N_1=8$ rows and $N_2=8$ columns. It has $N_C=512$ channels per pixel. So let's say that the input is $x[i,n_1,n_2,c]$ in the $i^{\textrm{th}}$ image, $n_1^{\textrm{st}}$ row, $n_2^{\textrm{th}}$ column, and $c^{\textrm{th}}$ channel.

$$\hat{y}[n_1,n_2,a,4] = \sigma\left(\xi^{(2)}[n_1,n_2,a,4]\right)$$

The regression part is all of the rest. The regression part just uses a linear output layer:

$$\hat{y}[n_1,n_2,a,0:4] = \xi^{(2)}[n_1,n_2,a,0:4]$$

3. detect

The detect function takes the output of forwardprop, and uses it to detect candidate face rectangles in the image.

Your function detect should find the (n1,n2,a) tuples that the neural net thinks are most probable (highest classification probability). For each one, it should convert the regression output back into an image rectangle, and append it to the best_rects output. Then we can plot those best_rects on the original image.

You should not expect it to give very accurate results, yet, since the network weights have been initialized randomly, and not trained yet!

4. loss

Now we need to compute the loss. For Faster RCNN, the loss has two parts: $${\mathcal L} = {\mathcal L}_{MSE} + {\mathcal L}_{BCE}$$

The MSE loss is the average squared difference between the regression target and the regression output, averaged over only those rectangles where the ground truth says that the rectangle contains a face ($y[n_1,n_2,a,4]=1$):

$${\mathcal L}_{MSE} = \frac{1}{2} \frac{\sum_{n_1=0}^{N_1-1}\sum_{n_2=0}^{N_2-1}\sum_{a=0}^{N_A-1} y[n_1,n_2,a,4]\times\Vert y[n_1,n_2,a,0:4]-\hat{y}[n_1,n_2,a,0:4]\Vert^2} {\sum_{n_1=0}^{N_1-1}\sum_{n_2=0}^{N_2-1}\sum_{a=0}^{N_A-1}y[n_1,n_2,a,4]} $$

The other term in the loss, ${\mathcal L}_{BCE}$, is the usual binary cross entropy loss for the classification output of the network ($\hat{y}[n_1,n_2,a,4]$), summed over position ($n_1,n_2$) and anchor ($a$):

$${\mathcal L}_{BCE}=-\frac{1}{N_1N_2N_A}\sum_{n_1=0}^{N_1-1}\sum_{n_2=0}^{N_2-1}\sum_{a=0}^{N_A-1}y[n_1,n_2,a,4]\ln\hat{y}[n_1,n_2,a,4]+(1-y[n_1,n_2,a,4])\ln(1-\hat{y}[n_1,n_2,a,4])$$

Well, that's a little dissatisfying -- are those numbers large, or small? The answer is, there's no way to know, really, until we try training the network for a while, to see if we can make those numbers smaller.

5. backprop

The loss function (loss) is really important for debugging (as we'll see later), but it's actually not necessary to train the network. To train the network, what we really need is the derivative of the loss, which we can compute without ever computing the loss itself. The derivative of the loss is

$$\nabla_\xi{\mathcal L}[n_1,n_2,a,k]=\frac{d{\mathcal L}}{d\xi[n_1,n_2,a,k]}$$

where $\xi[n_1,n_2,a,k]$ is the excitation (before the sigmoid nonlinearity) for the $a^{\textrm{th}}$ anchor at the $(n_1,n_2)^{\textrm{th}}$ position, for the k'th network output.

BUG ALERT!

At the time of this writing (12/1/2021), I just learned that there is a bug in the solutions: the backprop function doesn't include the normalizing constants in the loss of the denominator. If you are still working on this MP, please write the function backprop so that it calculates the derivative of the following loss, instead of the loss from your loss function:

$${\mathcal L}_{MSE} = \frac{1}{2}\sum_{n_1=0}^{N_1-1}\sum_{n_2=0}^{N_2-1}\sum_{a=0}^{N_A-1} y[n_1,n_2,a,4]\times\Vert y[n_1,n_2,a,0:4]-\hat{y}[n_1,n_2,a,0:4]\Vert^2 $$$${\mathcal L}_{BCE}=-\sum_{n_1=0}^{N_1-1}\sum_{n_2=0}^{N_2-1}\sum_{a=0}^{N_A-1}y[n_1,n_2,a,4]\ln\hat{y}[n_1,n_2,a,4]+(1-y[n_1,n_2,a,4])\ln(1-\hat{y}[n_1,n_2,a,4])$$

6. weight_gradient

The weight gradient is computed by taking the loss gradient w.r.t. a layer's output, multiplied by the inputs of the same layer. Thus, for example,

$$\nabla_{W^{(1)}}{\mathcal L}[m_1,m_2,c,d]=\sum_{n_1}\sum_{n_2}\nabla_{\xi^{(1)}}{\mathcal L}[n_1-m_1,n_2-m_2,d]x[n_1,n_2,c],~~-1\le m_1\le 1,-1\le m_2\le 1$$$$\nabla_{W^{(2)}}{\mathcal L}[0,0,d,a,k]=\sum_{n_1}\sum_{n_2}\nabla_{\xi^{(2)}}{\mathcal L}[n_1,n_2,a,k]h[n_1,n_2,d]$$

Here's the same thing for your submitted.py code:

7. weight_update

The weight update is just computed by subtracting the gradient, multiplied by a learning rate $\eta$:

$$W^{(1)}[m_1,m_2,c,d] = W^{(1)}[m_1,m_2,c,d]-\eta\nabla_{W^{(1)}}{\mathcal L}[m_1,m_2,c,d]$$$$W^{(2)}[0,0,d,a,k]=W^{(2)}[0,0,d,a,k]-\eta\nabla_{W^{(2)}}{\mathcal L}[0,0,d,a,k]$$

8. Debugging: Check to make sure that the loss is decreasing

Back-propagation is really difficult to debug. If you made a mistake in all of those previous steps (and if you didn't have this MP's reference solutions to compare to), how would you know?

One useful method is to try several steps forward and backward along the gradient, and measure the loss at each step. If the gradient was computed correctly, you should see that loss increases in the direction of the gradient, and decreases in the direction of the negative gradient.

We can do this by trying several different values of the "learning rate" (which is a sort of normalized step size), and then plotting the loss as a function of the step size.

9. Conclusion

Congratulations --- you've learned how to train a Faster-RCNN object detector!

BTW, you probably noticed that this whole Jupyter notebook has only carried out training on one image. The so-called training set has only 50 images, which is still not enough to train such a large neural net. If you really wanted to train a successful face-detection neural net, you'd probably want to use the whole WIDER-face dataset, and you'd want to run many epochs of training, probably on a GPU, using a second-order optimization method like LBFGS or Adam, probably using a package like pytorch.

The main steps, however, would still be the same as those we've done here:

  1. forwardprop
  2. backprop
  3. weight_gradient
  4. weight_update
  5. measure the loss, to make sure it's still decreasing. If it isn't, stop training.