# Principal Component Analysis

## Learning Objectives

- Understand why Principal Component Analysis is an important tool in analyzing data sets
- Know the pros and cons of PCA
- Be able to implement PCA algorithm

## What is PCA?

*PCA*, or *Principal Component Analysis*, is an algorithm to reduce a large data set without loss of important imformation. Basically it detects the directions for maximum variance and project the original data set to a lower dimensional subspace (up to a change of basis) that still contains most of the important imformation.

- Pros: Only the “least important” variables are omitted, the more valuable variables are kept. Moreover, the created new variables are mutually independent, which is essential for linear models.
- Cons: The new variables created will have different meanings than the original dataset. (Loss of interpretability)

## PCA Algorithm

Suppose we are given a large data set of dimension , and we want to reduce the data set to a smaller one of dimension without loss of important information. We can achieve this by carrying out PCA algorithm with the following steps:

- Shift the data set so that it has zero mean: .
- Compute
*SVD* for the original data set: .
- Note that the
*variance* of the data set are determined by the singular values of , i.e. .
- Note that the columns of represents the
*principal directions* of the data set.
- Our new data set is .
- Since we want to reduce the dimension of the data set, we only use the most important principal directions, i.e. the first columns of V. Thus in the above Equation , has the desired dimension .

Note that the variance of the data set corresponds to the singular values: , as indicated in step 3.

### Python code for PCA

Here we assume features are stored in columns.

```
import numpy as np
import numpy.linalg as la
A = A - np.mean(A, axis=0).reshape((2,1))
U, S, Vt = la.svd(A)
A_new = A @ Vt.T
var = A_new.T@A_new
```

## ChangeLog