Principal Component Analysis
Learning Objectives
- Understand why Principal Component Analysis is an important tool in analyzing data sets
- Know the pros and cons of PCA
- Be able to implement PCA algorithm
What is PCA?
PCA, or Principal Component Analysis, is an algorithm to reduce a large data set without loss of important imformation. Basically it detects the directions for maximum variance and project the original data set to a lower dimensional subspace (up to a change of basis) that still contains most of the important imformation.
- Pros: Only the “least important” variables are omitted, the more valuable variables are kept. Moreover, the created new variables are mutually independent, which is essential for linear models.
- Cons: The new variables created will have different meanings than the original dataset. (Loss of interpretability)
PCA Algorithm
Suppose we are given a large data set of dimension , and we want to reduce the data set to a smaller one of dimension without loss of important information. We can achieve this by carrying out PCA algorithm with the following steps:
- Shift the data set so that it has zero mean: .
- Compute SVD for the original data set: .
- Note that the variance of the data set are determined by the singular values of , i.e. .
- Note that the columns of represents the principal directions of the data set.
- Our new data set is .
- Since we want to reduce the dimension of the data set, we only use the most important principal directions, i.e. the first columns of V. Thus in the above Equation , has the desired dimension .
Note that the variance of the data set corresponds to the singular values: , as indicated in step 3.
Python code for PCA
Here we assume features are stored in columns.
import numpy as np
import numpy.linalg as la
A = A - np.mean(A, axis=0).reshape((2,1))
U, S, Vt = la.svd(A)
A_new = A @ Vt.T
var = A_new.T@A_new
ChangeLog