Principal Component Analysis

Learning Objectives

Understand why Principal Component Analysis is an important tool in analyzing data sets
Know the pros and cons of PCA
Be able to implement PCA algorithm

What is PCA?

PCA, or Principal Component Analysis, is an algorithm to reduce a large data set without loss of important imformation. Basically it detects the directions for maximum variance and project the original data set to a lower dimensional subspace (up to a change of basis) that still contains most of the important imformation.

Pros: Only the “least important” variables are omitted, the more valuable variables are kept. Moreover, the created new variables are mutually independent, which is essential for linear models.
Cons: The new variables created will have different meanings than the original dataset. (Loss of interpretability)

PCA Algorithm

Suppose we are given a large data set $\bf A$ of dimension $m \times n$ , and we want to reduce the data set to a smaller one ${\bf A}^*$ of dimension $m \times k$ without loss of important information. We can achieve this by carrying out PCA algorithm with the following steps:

Shift the data set $\bf A$ so that it has zero mean: ${\bf A} = {\bf A} - {\bf A}.mean()$ .
Compute SVD for the original data set: ${\bf A}= {\bf U \Sigma V}^T$ .
Note that the variance of the data set are determined by the singular values of $\bf A$ , i.e. $\sigma_1, ... , \sigma_n$ .
Note that the columns of $\bf V$ represents the principal directions of the data set.
Our new data set is ${\bf A}^* := {\bf AV} ={\bf U\Sigma}$ .
Since we want to reduce the dimension of the data set, we only use the most important $k$ principal directions, i.e. the first $k$ columns of V. Thus in the above Equation ${\bf A}^* = {\bf AV}$ , ${\bf A}^*$ has the desired dimension $m \times k$ .

Note that the variance of the data set corresponds to the singular values: $({\bf A}^*)^T {\bf A}^*= {\bf V}^T{\bf A}^T{\bf AV}={\bf \Sigma}^T{\bf \Sigma}$ , as indicated in step 3.

Python code for PCA

Here we assume features are stored in columns.

import numpy as np
import numpy.linalg as la

A = A - np.mean(A, axis=0).reshape((2,1))
U, S, Vt = la.svd(A)
A_new = A @ Vt.T
var = A_new.T@A_new

ChangeLog

2020-08-09 Yikai Teng yikait2@illinois.edu: outline
2020-11-30 Jerry Yang jiayiy7@illinois.edu: fix pca code