import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(font_scale=2)
sns.set_style("whitegrid")
We will be looking at the FIFA 2018 Dataset. While this is a video game, the developers strive to make their game as accurate as possible, so this data reflects the skills of the real-life players.
Let's load the data frame using pandas
.
df = pd.read_csv("FIFA_2018.csv",encoding = "ISO-8859-1",index_col = 0, low_memory = False)
We can take a brief look at the data by calling df.head()
. The first 34 columns are attributes that describe the behavior (e.g. aggression) or the skills (e.g. ball control), of each player. The final columns show the player's position, name, nationality, and the club they play for.
The four positions are forward (FWD), midfielder (MID), defender (DEF), and goalkeeper (GK).
df.head()
We already know that identifying goal-keepers is quite straight-forward, so let's remove the data corresponding from goal-keepers:
df2 = df[df["Position"] != "GK"].copy()
df2.drop(['GK diving',
'GK handling',
'GK kicking',
'GK positioning',
'GK reflexes'],1,inplace=True)
We can get all the attribute names and store them as labels
by using .columns.values
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
validation_size = 0.3
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=validation_size, random_state=seed)
print('%30s %16s' % ("Classifier","accuracy") )
for name, clf in list(dict_classifiers.items()):
clf.fit(X_train, Y_train)
y_result = clf.predict(X_test)
acc = accuracy_score(Y_test, y_result)
print('%30s %16f' % (name, acc) )
cmat = confusion_matrix(Y_test, y_result,labels=["DEF","MID","FWD"])
print(cmat)