import pandas as pd
import numpy as np
from io import StringIO
import numpy.linalg as la
import matplotlib.pyplot as plt
from matplotlib import cm as cm
import seaborn as sns
sns.set(font_scale=2)
plt.style.use('seaborn-whitegrid')
%matplotlib inline
A tumor is a mass of abnormal tissue. Malignant and benign tumors have different cell growth characteristics.
For this activity, you will use python libraries such as seaborn
and scikit-learn
to:
params = ["radius", "texture", "perimeter", "area",
"smoothness", "compactness", "concavity",
"concave points", "symmetry", "fractal dimension"];
stats = ["(mean)", "(stderr)", "(worst)"]
labels = ["patient ID", "Malignant/Benign"]
for p in params:
for s in stats:
labels.append(p + " " + s)
tumor_data = pd.io.parsers.read_csv("breast-cancer-dataset.dat",header=None,names=labels)
feature_labels = labels[2:]
How many features in this dataset? How many patients?
You can inspect the first few lines of your data using tumor_data.head()
Define X
as the dataset with the features for all patients (not including the patient ID and diagnosis):
Inspect the mean value for each feature using X.mean()
Inspect the standard deviation for each feature using X.std()
Define Y, an array with the diagnosis for each patient. We will later use a classification algorithm to help predict patient's diagnosis (based on the features, should a patient tumor be classified as a M
or B
?)
For visualization purposes only, you can select a subset of the features, and use
seaborn.pairplot(data)
to plot pairwise relationships in the dataset (you could plot all of the feature pairs, but this would generate a lot of plots!)
mean_label = [labels[1]] + labels[2::3]
#sns.pairplot(tumor_data[mean_label], hue="Malignant/Benign", plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'})
Or we can look at the plot of the correlation matrix for all the featues:
corr_matrix = X.corr()
# plot correlation matrix
fig = plt.figure(figsize = (14,8))
ax1 = fig.add_subplot(111)
cax = ax1.imshow(corr_matrix, cmap=cm.get_cmap('jet'))
plt.title('Tumor features correlation matrix')
plt.grid(False)
ax1.set_xticks(np.arange(len(feature_labels)))
ax1.set_yticks(np.arange(len(feature_labels)))
ax1.set_xticklabels(feature_labels,fontsize=10,rotation=90)
ax1.set_yticklabels(feature_labels,fontsize=10)
fig.colorbar(cax)
Overwhelming? Don’t worry, in some cases the effective dimension of your problem might be much smaller than the number of the features, such as in data sets where some features are irrelevant. Is that the case here?
To assess the model’s performance later, we divide the dataset into two parts: a training set and a test set. The first is used to train the system, while the second is used to evaluate the learned or trained system.
We are going to use sklearn.model_selection.train_test_split
to split the dataset
from sklearn.model_selection import train_test_split
A common splitting choice is to take 2/3 of your original data set as the training set, while the 1/3 that remains will compose the test set. You should select this proportion by assigning the variable s
and setting the argument test_sizes = s
We will fix the seed for the random number generator, in order to get reproducible results
Use: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=s, random_state=seed)
to split the dataset into the set of features and diagnosis used for training (X_train,Y_train) and a set used for evaluating the trained system (X_test,Y_test)
When you use algorithms for classification tasks, not all features will contribute well to the generalization capacity of your model. Some irrelevant and correlated attributes can even decrease the performance of some algorithms, contributing to overfitting, for example.
It is now your responsibility to choose the best set of features that will make your models performs better! Use PCA to reduce the feature space of the training set while still retaining the most meaningful information about the features.
We will now center the training data
by making it have zero mean and unit standard deviation. We will first do this step explicility, like we have done in previous activities, before calling SVD to compute the variances. Let's revise how we were solving PCA before:
Xzero = X_train-X_train.mean()
Xscaled = Xzero/X_train.std()
U,sigma,Vt = la.svd(Xzero,full_matrices=False)
variance=sigma**2
total_variance = sum(variance)
print(variance/total_variance)
From now on, we will use the PCA class from sklearn:
from sklearn.decomposition import PCA
The PCA class scales the data to have zero mean. We can also use the method
pca.explained_variance_ratio_
to obtain the explained variance.
pca = PCA().fit(X_train)
print(pca.explained_variance_ratio_)
Note that we obtained the same explained variances when using SVD in 'Xzero'.
In this tutorial, we will use the library StandardScaler
to scale the training set for us.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
sklearn's transform's fit() just calculates the parameters (e.g. 𝜇 and 𝜎 in case of StandardScaler) and saves them as an internal objects state. Afterwards, you can call its transform() method to apply the transformation to a particular set of examples.
Xs_train = scaler.transform(X_train)
We will later apply the same transformation to the testing set, using the same two parameters 𝜇 and 𝜎 (values) that you used for centering the training set.
In summary, all we have to do now is:
pca = PCA().fit(Xs_train)
var_exp = pca.explained_variance_ratio_
print(var_exp)
You can check above that you would obtain the same explained variances if you have used Xscaled
to obtain the SVD
Plot the explained variance and the cummulative explained variance. If you want to retain the principal components that capture 80% of the variance, how many components should you choose?
cum_var_exp = np.cumsum(var_exp)
print(cum_var_exp)
plt.bar(range(len(var_exp)), var_exp, alpha=0.5, align='center',
label='individual explained variance')
plt.step(range(len(var_exp)), cum_var_exp, where='mid',
label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(bbox_to_anchor=(1.1, 1.05))
plt.tight_layout()
Define the variable Z_train
the projected training set with reduced feature space that captures 90% of the variance (remember Z_train = X_train V
where V
is the right singular vector matrix?)
How many components should you consider?
Two different approaches...
You can now project your test set onto the Principal Components of the training set.
Z_train = pca.transform(Xs_train)
Z_test = pca.transform(Xs_test)
# set colors for each data point
C = []
classes = list(set(Y_train))
for c in Y_train:
if c == classes[0]:
C.append('r')
else:
C.append('b')
plt.figure()
plt.title("Training Data Projected into 2D Subspace")
plt.xlabel("1st Component")
plt.ylabel("2nd Component")
plt.scatter(Z_train[:, 0], Z_train[:, 1], c=C)
plt.show()
We will first use a logistic regression model. This is used for classification tasks where data points can only be a member of one class. The model can be solved either using a modified version of least squares or newton's method.
from sklearn.linear_model import LogisticRegression
Make an instance of the model, using all the default parameters for now.
model = LogisticRegression(solver="lbfgs")
#Recall...
print(Z_train.shape,Xs_train.shape)
Training the model on the data
model.fit(Z_train,Y_train)
Once you have the trained model, you can get the predicted diagnosis for the patients in the set X_test
, and compare them with the actual
diagnosis in Y_test
Ypredict = model.predict(Z_test)
# Accuracy: Overall, how often is the classifier correct?
(Y_test==Ypredict).mean()
test_score = model.score(Z_test, Y_test)
print(test_score)
from sklearn.metrics import roc_curve, roc_auc_score, auc, f1_score, confusion_matrix
Confusion matrix: describes the performance of the classification model on a set of test data for which the true values are known.
# use "labels" argument to re-organize the entries in the confusion matrix
# (we typically treat malignant as positive and benign as negative)
cmat = confusion_matrix(Y_test,Ypredict, labels=["M", "B"])
print(cmat)
TP, FP, FN, TN = cmat.ravel()
npatients = Y_test.shape[0]
Confusion matrix = $ \begin{bmatrix} \textrm{Predicted positive and actual positive} & \textrm{Predicted positive and actual negative} \\ \textrm{Predicted negative and actual positive} & \textrm{Predicted negative and actual negative}\end{bmatrix} = \begin{bmatrix} \textrm{TP} & \textrm{FP} \\ \textrm{FN} & \textrm{TN} \end{bmatrix}$
TP: Predicted yes (malignant cancer), and the patients do have the disease. (How many malignant tumors are correctly identified? )
FP: Predicted yes (malignant cancer), but patients don't actually have the disease. (How many benign tumors are identified as malignant? Also known as a "Type I error.")
FN: Predicted no (benign cancer), but the patients actually have the disease. (How many malignant tumors are missed? Also known as a "Type II error.")
TN: Predicted no (benign cancer), and patients don't have the disease. (How mamy benign tumors are correctly identified?)
# Accuracy: Overall, how often is the classifier correct? We sum the diagonal of the confusion matrix!
(TP+TN)/npatients
Accuracy is not always a reliable metric for the real performance of a classifier. Let's say there are 100 patients and 5 have malignant cancer. If the model classifies all the 100 patients as not having cancer, than the overall accuracy is 95% (TP = 0, TN = 95, FN = 5, FP = 0).
Let's take a look at other metrics:
True positive rate (TPR), or sensitivity or recall: When it's actually yes, how often does it predict yes? TPR = TP/(TP+FN)
True negative rate (TNR) or specificity: When it's actually no, how often does it predict no? TNR = TN/(FP+TN)
For the example above, the classifier has 100% specificity (predicts 100% of the benign cancers) and 0% sensitivity (fails to classify all cancerous tumors).
The "best" metric to use will depend on the classification problem. Back to our original dataset...
# True positive rate (TPR), or sensitivity or recall: When it's actually yes, how often does it predict yes?
TPR = TP/(TP+FN)
print(TPR)
# True negative rate(TNR) or specificity: When it's actually no, how often does it predict no?
TNR = TN/(FP+TN)
print(TNR)
# Precision: When it predicts yes, how often is it correct?
p = TP/(TP+FP)
print(p)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
dict_classifiers = {
"Logistic Regression": LogisticRegression(),
"Nearest Neighbors": KNeighborsClassifier(),
"Linear SVM": SVC(),
"Gradient Boosting Classifier": GradientBoostingClassifier(),
"Random Forest": RandomForestClassifier(),
"Naive Bayes": GaussianNB()
}
# Determine accuracy of classifier given data pairs (X, Y)
def accuracy(classifier, X, Y):
Ypredict = classifier.predict(X)
return (Y==Ypredict).mean()
def batch_classify(X_train, Y_train, X_test, Y_test, verbose = True):
print("{:34s}{:15s}{:12s}{:12s}".format("Classifiers", "Test-score", "TPR", "TNR"))
train_score = []
test_score = []
labels = []
for classifier_name, classifier in list(dict_classifiers.items()):
# Fit the model
labels.append(classifier_name)
classifier.fit(X_train, Y_train)
# Make predictions on the test data
predictions = classifier.predict(X_test)
cmat = confusion_matrix(Y_test, predictions, labels=["M", "B"])
TP, FP, FN, TN = cmat.ravel()
# Get scores
train_score.append(accuracy(classifier, X_train, Y_train))
test_score.append(accuracy(classifier, X_test, Y_test))
if verbose:
print("{c:30}{t1:12f}{t2:12f}{t3:12f}".format(c=classifier_name,t1=test_score[-1], t2=TP/(TP+FN), t3=TN/(FP+TN) ), )
batch_classify(Z_train, Y_train, Z_test, Y_test, verbose = True)
batch_classify(Xs_train, Y_train, Xs_test, Y_test, verbose = True)