# Principal Component Analysis

## Learning Objectives

• Understand why Principal Component Analysis is an important tool in analyzing data sets
• Know the pros and cons of PCA
• Be able to implement PCA algorithm

## What is PCA?

PCA, or Principal Component Analysis, is an algorithm to reduce a large data set without loss of important imformation. Basically it detects the directions for maximum variance and project the original data set to a lower dimensional subspace (up to a change of basis) that still contains most of the important imformation.

• Pros: Only the “least important” variables are omitted, the more valuable variables are kept. Moreover, the created new variables are mutually independent, which is essential for linear models.
• Cons: The new variables created will have different meanings than the original dataset. (Loss of interpretability)

## PCA Algorithm

Suppose we are given a large data set $\bf A$ of dimension $m \times n$, and we want to reduce the data set to a smaller one ${\bf A}^*$ of dimension $m \times k$ without loss of important information. We can achieve this by carrying out PCA algorithm with the following steps:

1. Shift the data set $\bf A$ so that it has zero mean: ${\bf A} = {\bf A} - {\bf A}.mean()$.
2. Compute SVD for the original data set: ${\bf A}= {\bf U \Sigma V}^T$.
3. Note that the variance of the data set are determined by the singular values of $\bf A$, i.e. $\sigma_1, ... , \sigma_n$.
4. Note that the columns of $\bf V$ represents the principal directions of the data set.
5. Our new data set is ${\bf A}^* := {\bf AV} ={\bf U\Sigma}$.
6. Since we want to reduce the dimension of the data set, we only use the most important $k$ principal directions, i.e. the first $k$ columns of V. Thus in the above Equation ${\bf A}^* = {\bf AV}$, ${\bf A}^*$ has the desired dimension $m \times k$.

Note that the variance of the data set corresponds to the singular values: $({\bf A}^*)^T {\bf A}^*= {\bf V}^T{\bf A}^T{\bf AV}={\bf \Sigma}^T{\bf \Sigma}$, as indicated in step 3.

### Python code for PCA

Here we assume features are stored in columns.

import numpy as np
import numpy.linalg as la

A = A - np.mean(A, axis=0).reshape((2,1))
U, S, Vt = la.svd(A)
A_new = A @ Vt.T
var = A_new.T@A_new