Principal Component Analysis

Learning Objectives

What is PCA?

PCA, or Principal Component Analysis, is an algorithm to reduce a large data set without loss of important imformation. Basically it detects the directions for maximum variance and project the original data set to a lower dimensional subspace (up to a change of basis) that still contains most of the important imformation.

PCA Algorithm

Suppose we are given a large data set of dimension , and we want to reduce the data set to a smaller one of dimension without loss of important information. We can achieve this by carrying out PCA algorithm with the following steps:

  1. Shift the data set so that it has zero mean: .
  2. Compute SVD for the original data set: .
  3. Note that the variance of the data set are determined by the singular values of , i.e. .
  4. Note that the columns of represents the principal directions of the data set.
  5. Our new data set is .
  6. Since we want to reduce the dimension of the data set, we only use the most important principal directions, i.e. the first columns of V. Thus in the above Equation , has the desired dimension .

Note that the variance of the data set corresponds to the singular values: , as indicated in step 3.

Python code for PCA

Here we assume features are stored in columns.

import numpy as np
import numpy.linalg as la

A = A - np.mean(A, axis=0).reshape((2,1))
U, S, Vt = la.svd(A)
A_new = A @ Vt.T
var = A_new.T@A_new