MP3: Eigenfaces

In this lab, you'll find the principal components transformation of a face dataset, then test a nearest-neighbor classifier.

In order to make sure everything works, you might want to go to the command line, and run

pip install -r requirements.txt

This will install the modules that are used on the autograder, including numpy, h5py, and the gradescope utilities.


Part 0: Loading the image files

First, let's load the images. The images are examples, drawn from the "Labeled Faces in the Wild" dataset: http://vis-www.cs.umass.edu/lfw/. We'll start out defining some utility functions.

Now let's load the images, and then plot one from each person, with the person names as titles.

Now let's view them as vectors. We know that they're 187,500-dimensional vectors, but let's just plot scatter plots of the first few dimensions (the first few pixels, in the red color plane).

It looks like the pixels are strongly correlated. What the scatter plots don't show is the fraction of all pixels that are zero (black) --- that seems to be the majority, but it's not clear from the scatter plot. To make sure, let's plot histograms.


Part 1: Compute a Mean Vector

Principal Components Analysis can be computed either with or without mean subtraction. Subtracting the mean is a little more useful, usually, so let's do it that way.


Part 2: Center all three datasets (train, dev, and test)

We will "center" the datasets by subtracting the training dataset mean.

Notice that we subtract the same $\vec\mu$ from all three datasets. That's because only the training dataset is used to estimate model parameters, like $\vec\mu$. The dev dataset is used to choose between different trained models, and the test dataset is used to find out how well our chosen model will perform on data from outside the training set.

Notice that these plots are exactly the same as before, but shifted downward slightly -- the minimum pixel values are now around -50, while the maximum is around 200. The reason it didn't shift by very much is that so many of the pixels were originally close to 0. Now we've shifted it so the mean is zero.


Part 3: Compute the Principal Component Axes

The Principal Component Axes, $V$, are the eigenvectors of the data covariance matrix. Since the vector dimension (187500) is much, much larger than the number of training tokens (165), you should compute the eigenvectors of the gram matrix instead ($U$), and then multiply $V=X^TU$ instead. Then be sure to normalize the columns of $V$ so that each column has unit Euclidean norm, and its first element is non-negative.

The vector $D$ lists the eigenvalues of the gram matrix, which are also the eigenvalues of the sum-of-squares matrix (the scaled covariance). If you add them, you get the total variance of all samples in the dataset.


Part 4: Transform all three centered datasets

The purpose of using PCA is that, usually, we can get better accuracy by

  1. transforming the dataset into its principal components, so that each of the components is uncorrelated with the others,
  2. keeping only the most important principal components.

Since we computed the first 165 principal components, the transformed datasets will only have 165 dimensions, instead of 187500.


Part 5: Find the nearest neighbors

A nearest-neighbor classifier first calculates the distance between each devtest vector, $\vec{x}_{\mbox{dev},i}$, and each training vector, $\vec{x}_{\mbox{train},j}$. Then it figures out which of the training vectors is closest to each devtest vector:

$$j^* = \arg\min_j\Vert\vec{x}_{\mbox{dev},i}-\vec{x}_{\mbox{train},j}\Vert$$

Once we know which training vector is closest, we just copy its label to the devtest vector:

$$f(\vec{x}_{\mbox{dev},i})=y_{\mbox{train},j^*}$$

The confusion matrix is a way of showing the following information:

$$C[i,j]=\mbox{Number of tokens from class}~i~\mbox{that were classified as class}~j$$

The accuracy is the fraction of tokens correctly classified, i.e.,

$$A = \frac{\sum_{i}C[i,i]}{\sum_i\sum_j C[i,j]}$$

Part 6: Find the best feature size using devset

Here's where the difference between the devtest and test sets comes in.

So far, we've been keeping all 165 of the principal components. But that doesn't really make sense -- if we only have 165 training tokens, then a classifier with 165 dimensions is probably overtrained (i.e., it probably gets much better accuracy on the training set than on the test set).

We can use the devtest set to decide how many of the principal components we should keep. This is done by exhaustively testing every possible setting: for every possible vector size, we test the resulting nearest-neighbor classifier on the devtest data. Whichever vector size gives the best result on the devtest data, that's the one we will then test on the real test data.

The reason for this is that, by estimating the vector size on the devtest data, we might be overtraining again -- maybe the classifier that gets the best performance on the devtest data doesn't really work all that well in practice. If that's true, then we want to know about it. We can find out about it by adjusting the vector size using devtest data, then measuring its performance using real test data.

Rather than testing every possible vector size, we're only going to test vector sizes, $K$, that are reasonable according to the percent variance explained criterion. Remember that the percentage of variance explained by the first $K$ principal components is $$PV = 100\times \frac{\sum_{k=1}^K\lambda_k}{\sum_{k=1}^n\lambda_k}$$ where $n$ is the total number of principal components (in our case, 165), and $\lambda_k$ are the eigenvalues. From lots of published experiments, we know that good values of $K$ usually have roughly $$92.5\% < PV < 97.5\%$$ so our "exhaustive test" will actually only test values of $K$ in that range.


Part 7: Show the Test Confusion Matrix

Now that we know that we should use 91 principal components, we can use that fact to find the accuracy on the real test data.


Part 8: How to Debug!!

If you reached this point in the notebook, then probably your code is working well, but before you run the autograder on the server, you should first run it on your own machine.

You can do that by going to a terminal, and running the following command line:

python grade.py

If you get any error messages, we recommend that you use the provided solutions.hdf5 in order to debug. That can be done as follows:


Extra Credit

You can earn up to 10% extra credit on this MP by finishing the file called extra.py, and submitting it to the autograder.

When you unpack the file mp2_extra.zip, it will give you the following files:

The extra credit assignment this time is to get the best classification accuracy you can, on both visible and hidden test cases.