Principal Component Analysis
Learning Objectives
- Understand why Principal Component Analysis is an important tool in analyzing data sets
- Know the pros and cons of PCA
- Be able to implement PCA algorithm
What is PCA?
PCA, or Principal Component Analysis, is an algorithm to reduce a large data set without loss of important imformation. Basically it detects the directions for maximum variance and project the original data set to a lower dimensional subspace (up to a change of basis) that still contains most of the important imformation.
- Pros: Only the “least important” variables are omitted, the more valuable variables are kept. Moreover, the created new variables are mutually independent, which is essential for linear models.
- Cons: The new variables created will have different meanings than the original dataset. (Loss of interpretability)
PCA Algorithm
Suppose we are given a large data set A of dimension m×n, and we want to reduce the data set to a smaller one A∗ of dimension m×k without loss of important information. We can achieve this by carrying out PCA algorithm with the following steps:
- Shift the data set A so that it has zero mean: A=A−A.mean().
- Compute SVD for the original data set: A=UΣVT.
- Note that the variance of the data set are determined by the singular values of A, i.e. σ1,...,σn.
- Note that the columns of V represents the principal directions of the data set.
- Our new data set is A∗:=AV=UΣ.
- Since we want to reduce the dimension of the data set, we only use the most important k principal directions, i.e. the first k columns of V. Thus in the above Equation A∗=AV, A∗ has the desired dimension m×k.
Note that the variance of the data set corresponds to the singular values: (A∗)TA∗=VTATAV=ΣTΣ, as indicated in step 3.
Python code for PCA
Here we assume features are stored in columns.
import numpy as np
import numpy.linalg as la
A = A - np.mean(A, axis=0).reshape((2,1))
U, S, Vt = la.svd(A)
A_new = A @ Vt.T
var = A_new.T@A_new
ChangeLog