Principal Component Analysis

Learning Objectives

What is PCA?

PCA, or Principal Component Analysis, is an algorithm to reduce a large data set without loss of important imformation. Basically it detects the directions for maximum variance and project the original data set to a lower dimensional subspace (up to a change of basis) that still contains most of the important imformation.

PCA Algorithm

Suppose we are given a large data set \(\bf A\) of dimension \(m \times n\), and we want to reduce the data set to a smaller one \({\bf A}^*\) of dimension \(m \times k\) without loss of important information. We can achieve this by carrying out PCA algorithm with the following steps:

  1. Shift the data set \(\bf A\) so that it has zero mean: \({\bf A} = {\bf A} - {\bf A}.mean()\).
  2. Compute SVD for the original data set: \({\bf A}= {\bf U \Sigma V}^T\).
  3. Note that the variance of the data set are determined by the singular values of \(\bf A\), i.e. \(\sigma_1, ... , \sigma_n\).
  4. Note that the columns of \(\bf V\) represents the principal directions of the data set.
  5. Our new data set is \({\bf A}^* := {\bf AV} ={\bf U\Sigma}\).
  6. Since we want to reduce the dimension of the data set, we only use the most important \(k\) principal directions, i.e. the first \(k\) columns of V. Thus in the above Equation \({\bf A}^* = {\bf AV}\), \({\bf A}^*\) has the desired dimension \(m \times k\).

Note that the variance of the data set corresponds to the singular values: \(({\bf A}^*)^T {\bf A}^*= {\bf V}^T{\bf A}^T{\bf AV}={\bf \Sigma}^T{\bf \Sigma}\), as indicated in step 3.

Python code for PCA

Here we assume features are stored in columns.

import numpy as np
import numpy.linalg as la

A = A - np.mean(A, axis=0).reshape((2,1))
U, S, Vt = la.svd(A)
A_new = A @ Vt.T
var = A_new.T@A_new