import pandas as pd
import numpy as np
from io import StringIO
import numpy.linalg as la
import matplotlib.pyplot as plt
from matplotlib import cm as cm
import seaborn as sns
sns.set(font_scale=2)
plt.style.use('seaborn-whitegrid')
%matplotlib inline
For this activity, you will explore the basics of machine learning. Machine learning describes a class of methods for automatically building mathematical models based on training data. The dataset that we will work with will be a dataset of Pokemon.
In this activity, you will:
Load the dataset
# data source: https://www.kaggle.com/abcsds/pokemon/downloads/pokemon.zip/2
df = pd.read_csv("Pokemon.csv")
In the dataset, each row represents a Pokemon. How many Pokemon are in our dataset? How many features are in this dataset?
You can inspect the first few lines of your data using df.head( )
Define an array y, such that it contains whether a given Pokemon is legendary or not. The $i$th entry of y denotes whether the $i$th Pokemon is legendary (True
) or not (False
). We will later use a classification algorithm to help predict if a Pokemon is legendary.
Not every classifier can work with string or boolean types. Instead of having the array y
as booleans, we can replace True
with 1 and False
with 0.
What are the features in our data that can be used to determine the legendary status of a Pokemon?
Save these features in the variable labels
. Hint: there are 7 features.
Create another dataframe (name it X
) with the relevant features.
Then get the numpy array x
with the values of the DataFrame X
To assess the model’s performance later, we divide the dataset into two parts: a training set and a test set. The first is used to train the system, while the second is used to evaluate the learned or trained model.
We are going to use sklearn.model_selection.train_test_split to split the dataset
from sklearn.model_selection import train_test_split
A common splitting choice is to take 2/3 of your original data set as the training set, while the 1/3 that remains will compose the test set. You should select this proportion by assigning the variable s and setting the argument test_sizes = s in sklearn.model_selection.train_test_split.
s = 0.33
We will fix the seed for the random number generator, in order to get reproducible results
seed = 41
Split the arrays x
and y
into training data (X_train,Y_train) and test data (X_test,Y_test)
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Now that we have a dataset to train our model and a dataset to validate our model, we need to construct a model.
To introduce this, we will begin by using a logistic regression model. This is used for classification tasks where data points can only be a member of one class. The model can be solved either using a modified version of least squares or newton's method.
from sklearn.linear_model import LogisticRegression
Using the LogisticRegression
function, make an instance of the model. Use all the default parameters for now.
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
model = LogisticRegression(solver="lbfgs")
Using this instance of the model, let's use the training data to train the model.
Use model.fit(X_train, Y_train)
to train the model.
We now have a trained model and we can begin using it to make predictions. Recall that we want to use our model to predict whether a Pokemon is legendary or not.
Use the model to predict whether the Pokemon in the test dataset X_test
are legendary.
You can use the model to make predictions using the predict function
model.predict(X_test)
# these are the legendary Pokemon
print(Ypredict.sum())
print(Y_test.sum())
One way of determining the performance of our model is using a confusion matrix. A confusion matrix describes the performance of the classification model on a set of test data for which the true values are known. A confusion matrix stores the true positives, false positives, false negatives, and true negatives for our test data.
from sklearn.metrics import confusion_matrix
Let's use the confusion_matrix
function in sklearn to construct a confusion matrix for our dataset.
cmat = confusion_matrix(Y_test,Ypredict)
print("confusion matrix:\n",cmat)
TN, FP, FN, TP = cmat.ravel()
$$ \text{Confusion matrix} = \left[ \begin{array} {cccc} TN & FP\\ FN&TP \end{array} \right] $$
TN: Predicted no (not engendary), and the pokemon is not legendary. (How many non-legendary pokemons are correctly identified?)
FP: Predicted yes (legendary), but the pokemon is not legendary. (How many non-legendary pokemon are identified as legendary? )
FN: Predicted no (not lengendary), but the pokemon is actually legendary. (How many legendary pokemon are missed?)
TP: Predicted yes (legendary), and the pokemon is legendary. (How many legendary pokemons are correctly identified? )
1) Accuracy: fraction of correct classification https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
from sklearn.metrics import accuracy_score
accuracy_score(Y_test, Ypredict)
2) Precision: when it predicts yes (legendary), how often is the prediction correct? https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html
TP/(TP+FP)
from sklearn.metrics import precision_score
precision_score(Y_test, Ypredict)
3) Recall: when actually yes (legendary), how often is the prediction correct? https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html
TP/(TP+FN)
from sklearn.metrics import recall_score
recall_score(Y_test, Ypredict)
Starting with an initial dataset, we learned how to prepare the data, split the data, construct a model, and then use the model using sklearn.
Let's try and repeat this experiment now but with a different model. Below are 5 different classifiers (models) found in sklearn. Compare your results for each of the classifiers. Which works best for the task of determining legendary status of a Pokemon?
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC