Loading [MathJax]/jax/output/CommonHTML/jax.js

Principal Component Analysis


Learning Objectives

What is PCA?

PCA, or Principal Component Analysis, is an algorithm to reduce a large data set without loss of important imformation. Basically it detects the directions for maximum variance and project the original data set to a lower dimensional subspace (up to a change of basis) that still contains most of the important imformation.

PCA Algorithm

Suppose we are given a large data set A of dimension m×n, and we want to reduce the data set to a smaller one A of dimension m×k without loss of important information. We can achieve this by carrying out PCA algorithm with the following steps:

  1. Shift the data set A so that it has zero mean: A=AA.mean().
  2. Compute SVD for the original data set: A=UΣVT.
  3. Note that the variance of the data set are determined by the singular values of A, i.e. σ1,...,σn.
  4. Note that the columns of V represents the principal directions of the data set.
  5. Our new data set is A:=AV=UΣ.
  6. Since we want to reduce the dimension of the data set, we only use the most important k principal directions, i.e. the first k columns of V. Thus in the above Equation A=AV, A has the desired dimension m×k.

Note that the variance of the data set corresponds to the singular values: (A)TA=VTATAV=ΣTΣ, as indicated in step 3.

Python code for PCA

Here we assume features are stored in columns.

import numpy as np
import numpy.linalg as la

A = A - np.mean(A, axis=0).reshape((2,1))
U, S, Vt = la.svd(A)
A_new = A @ Vt.T
var = A_new.T@A_new

ChangeLog