# Principal Component Analysis

## Learning Objectives

• Understand why Principal Component Analysis is an important tool in analyzing data sets
• Know the pros and cons of PCA
• Be able to implement PCA algorithm

## What is PCA?

PCA, or Principal Component Analysis, is an algorithm to reduce a large data set without loss of important imformation. Basically it detects the directions for maximum variance and project the original data set to a lower dimensional subspace (up to a change of basis) that still contains most of the important imformation.

• Pros: Only the “least important” variables are omitted, the more valuable variables are kept. Moreover, the created new variables are mutually independent, which is essential for linear models.
• Cons: The new variables created will have different meanings than the original dataset. (Loss of interpretability)

## PCA Algorithm

Suppose we are given a large data set $$\bf A$$ of dimension $$m \times n$$, and we want to reduce the data set to a smaller one $${\bf A}^*$$ of dimension $$m \times k$$ without loss of important information. We can achieve this by carrying out PCA algorithm with the following steps:

1. Shift the data set $$\bf A$$ so that it has zero mean: $${\bf A} = {\bf A} - {\bf A}.mean()$$.
2. Compute SVD for the original data set: $${\bf A}= {\bf U \Sigma V}^T$$.
3. Note that the variance of the data set are determined by the singular values of $$\bf A$$, i.e. $$\sigma_1, ... , \sigma_n$$.
4. Note that the columns of $$\bf V$$ represents the principal directions of the data set.
5. Our new data set is $${\bf A}^* := {\bf AV} ={\bf U\Sigma}$$.
6. Since we want to reduce the dimension of the data set, we only use the most important $$k$$ principal directions, i.e. the first $$k$$ columns of V. Thus in the above Equation $${\bf A}^* = {\bf AV}$$, $${\bf A}^*$$ has the desired dimension $$m \times k$$.

Note that the variance of the data set corresponds to the singular values: $$({\bf A}^*)^T {\bf A}^*= {\bf V}^T{\bf A}^T{\bf AV}={\bf \Sigma}^T{\bf \Sigma}$$, as indicated in step 3.

### Python code for PCA

Here we assume features are stored in columns.

import numpy as np
import numpy.linalg as la

A = A - np.mean(A, axis=0).reshape((2,1))
U, S, Vt = la.svd(A)
A_new = A @ Vt.T
var = A_new.T@A_new