In Fantasy Football, contestants choose from a pool of available (American) football players to build a team. Contestants' teams score points depending on how their chosen players performed in real-life. The more points scored, the better!
There are literally hundreds of websites and blogs dedicated to predicting who will have a good game. They use a variety of methodologies (including no methodology at all) to generate their predictions. We will try to develop a predictor using Linear Least Squares that will answer the question: "Should I pick this player?"
Bonus: This activity may help you with MP5, since you will be using similar data structures in that assignment.
We'll import our standard packages, along with pandas
, which is a python
data analysis library.
import numpy as np
import numpy.linalg as la
import pandas as pd
There are two data sets, FF-data-2018.csv
and FF-data-2019.csv
that were collected using scoring from the Yahoo Fantasy Football platform. The 2018 data was collected from here. You can choose other years going back to 2011 from a variety of platforms.
Let's read in the data and see what it looks like.
ff_2018 = pd.read_csv('FF-data-2018.csv')
ff_2018
There are 6,355 data points which have a number of fields. They are:
We can access the labels and put them in a list:
labels = list(ff_2018.columns)
print(labels)
We can print out the available values of the positions for the data set by passing the key Pos
as a string to the data set.
print(ff_2018['Pos'].values)
To remove all the duplicates, we can call the function numpy.unique
to access all distinct values. (Just like every other time you use a new function, review the documentation of numpy.unique
! You can do so by running a cell with the following command: np.unique?
)
positions = np.unique(ff_2018['Pos'])
print(positions)
Since the positions in football are so different, we really want to focus on one at a time. It would be very ambitious to try and create a general predictor for all positions. Let's focus on quarterbacks first.
How can we extract all the data for quarterbacks? We can find the rows in the dataframe that has position equal to QB
POS = 'QB'
ff_2018['Pos'] == POS
We will create another (smaller) dataframe that has the rows referring to the quarterback position.
df_POS = ff_2018[ff_2018['Pos'] == POS].copy()
df_POS.head()
We can access the names of all the quarterbacks by referring to the columns Name
df_POS['Name']
Linear Least Squares works with numerical data, not strings. Eventually, we will want our predictive models to incorporate whether the player played at home or on the road, or how good their opponent was. But the columns h/a
and Oppt
are strings:
df_POS['h/a']
df_POS['Oppt']
At this point, we need to make decisions about what numerical values these should take. For the home/away column:
let's make an array with the value +1.0 when the game is played at home, and -1.0 when the game is played away.
store this array as another column in the pandas dataframe, with label home_away
df_POS['home_away'] = np.where(df_POS['h/a']=='a',-1,1)
df_POS
For the opponents, we need some kind of information about how many points they give up to a position on average. We have compiled that information in a separate file, called team_rankings.py
. Importing this file will give us access to a collection of dictionaries that provides this information.
After importing this file, the number vs_2018[Pos][team]
will give us a relevant ranking.
from team_rankings import * # asterik just means we import everything from that namespace
We can take a look at the keys in the dictionary:
print( vs_2018.keys() )
Note that the keys are just the player positions. Let's see the information for the key QB
(we have been storing this string in the variable POS
)
vs_2018[POS]
print(vs_2018[POS]['atl'])
print(vs_2018[POS]['buf'])
There are 32 football teams in the NFL.
The fact that vs_2018['QB']['atl']
has the value 1.0, means that the Atlanta Falcons gave up the most points to quarterbacks on average in the 2018 season.
Since vs_2018['QB']['buf']
has the value 32.0, this means that the Buffalo Bills gave up the least points to quarterbacks on average in the 2018 season.
So, we would expect a better performance out of a quarterback if he is playing the Atlanta Falcons, compared to the Buffalo Bills.
The rankings can be very different for different positions:
print(vs_2018['RB']['atl'])
print(vs_2018['RB']['buf'])
print()
print(vs_2018['WR']['atl'])
print(vs_2018['WR']['buf'])
print()
print(vs_2018['TE']['atl'])
print(vs_2018['TE']['buf'])
print()
print(vs_2018['Def']['atl'])
print(vs_2018['Def']['buf'])
print()
For the quarterback position (POS = 'QB'), convert the strings in the column Oppt
into their corresponding numerical values using the dictionary vs_2018
. Store this as another column of the pandas dataframe oppt_rank
def get_rank(x):
return vs_2018[POS][x]
df_POS['oppt_rank'] = df_POS['Oppt'].apply(get_rank)
df_POS
Now, players' names will be repeated in the array names
for every game they played. We will find it convenient to have another array collecting the names without these repeats. We'll use pandas.Series.unique
to do this.
unique_players = df_POS['Name'].unique()
len(unique_players)
So 73 quarterbacks played in 2018. But there are only 32 teams! Who are all these people?
print(unique_players[7])
print(unique_players[72])
I know who Tom Brady is, but I've never heard of Nate Sudfeld. Let's count how many times a players played a game.
We can use groupby
to group players by Name, and then count the number of times each player appears:
df_POS.groupby('Name')['Name'].count()
We want to add the frequency (game count) back to the original dataframe, and for that we will use transform to return an aligned index.
df_POS['game_count'] = df_POS.groupby('Name')['Name'].transform('count')
df_POS
Note that Nate Sudfeld only played in 1 game in 2018. He probably took over when the starter was injured, or when his team was involved in a lopsided game. We probably want to remove his data, since it won't be very helpful.
Let's us create an array of the names of all the players that are relevant to our analysis. For that, we will exclude the names for all the players that participated in less than min_games
.
min_games = 5
relevant_players = df_POS[df_POS['game_count']>=min_games]['Name'].unique()
print(len(relevant_players))
relevant_players
Now we only consider 43 quarterbacks playing in 2018.
Write a function prepare_data
that creates the dataframe df_POS
for a given player position. The function also returns as an argument the list of relevant unique players.
def prepare_data(ff_data,POS,min_games):
# returns (new_df,relevant_players) as described above
...
return(df_POS, relevant_players)
Test out that your function works as expected:
df_test,players_test = prepare_data(ff_2018,'WR',3)
df_test
We'll start with a simple linear model. For now, we will keep using our example where we constructed a dataset for quarterbacks in the variable df_POS
, along with relevant_players
The points scored in the previous $n$ games will be the only data considered when making a prediction. Let's look at what the model would look like for only one player, say Andy Dalton, with $n = 3$.
pl = relevant_players[13]
pl_points = df_POS[df_POS['Name']==pl]['YH points'].values
print('Player:', pl)
print('Points:', pl_points)
Andy Dalton played 11 games. So we could try to build a model that predicted the points he scored in his 4th game, based on his first 3, and similarly try to predict the points he scored in the 5th games based on games 2,3, and 4.
I.e. a "local" least squares system might look something like
$$\mathbf{Ax}\cong \mathbf{b}$$
where
$$\mathbf{A} = \begin{pmatrix} 17.52 & 26.6 & 18.08\\ 26.6 & 18.08 & 25.78 \\ 18.08 & 25.78 & 13.92 \\ 25.78 & 13.92 & 17.16 \\ 13.92 & 17.16 & 8.92 \\ 17.16 & 8.92 & 20.2 \\ 8.92 & 20.2 & 8.92 \\ 20.2 & 8.92 & 19.34 \end{pmatrix}, \hspace{5mm} \mathbf{b}= \begin{pmatrix} 25.78 \\ 13.92 \\ 17.16 \\ 8.92 \\ 20.2 \\ 8.92 \\ 19.34 \\ 9.1\end{pmatrix}$$
This was with $n = 3$ games. If instead, we base our "local" least squares on the previous $n = 4$ games, then our system would instead look like:
$$\mathbf{A} = \begin{pmatrix} 17.52 & 26.6 & 18.08 & 25.78\\ 26.6 & 18.08 & 25.78 & 13.92 \\ 18.08 & 25.78 & 13.92 & 17.16\\ 25.78 & 13.92 & 17.16 & 8.92 \\ 13.92 & 17.16 & 8.92 & 20.2 \\ 17.16 & 8.92 & 20.2 & 8.92\\ 8.92 & 20.2 & 8.92 & 19.34 \end{pmatrix},\hspace{4mm} \mathbf{b}= \begin{pmatrix} 13.92 \\ 17.16 \\ 8.92 \\ 20.2 \\ 8.92 \\ 19.34 \\ 9.1\end{pmatrix} $$
Write a function that generates this local system for a given (relevant) player. Use the example above to debug your function (i.e., data for Andy Dalton)
def player_point_history(df, pl, n_games):
# df: dataframe
# rel_player (string): name of a player
# n_games (int): number of games used for the prediction
...
return A,b
A,b = player_point_history(df_POS, relevant_players[13], 4)
print(A)
print(b)
Now, with this function, we can loop over the relevant players, generate their local systems, and "stack" them on top of each other to generate the global system. We'll do this with $n = 3$
n_games = 3
# empty array for right hand side of size M x 1
pts_scored = np.array([])
# empty array for matrix of size M x n_games. We had to reshape to size 0 x n_games to allow for "stacking"
game_hist = np.array([]).reshape(0,n_games)
for pl in relevant_players:
# generate local system
a,c = player_point_history(df_POS,pl,n_games)
# use numpy.append to append local system to global vector
pts_scored = np.append(pts_scored,c)
# use numpy.vstack (i.e. "vertical stack") to stack the global matrix and the local matrix
game_hist = np.vstack((game_hist,a))
print(pts_scored.shape)
print(game_hist.shape)
It would be an overly ambitious task to try to predict a players exact point total. What we can do instead is set a "threshold". I.e. if a player's points exceed this threshold, then we can deem them "startable". If they don't exceed this threshold, then we should look choose a different player.
What threshold should we use? That's debatable, but I've compiled the following dictionary based on additional data I collected from nfl.com.
start_threshold = {'QB': 19.3999, 'RB': 14.599, 'WR': 15.099, 'TE': 7.899, 'Def': 7.499}
So, if a quarterback scores more than 19.3999, we declare them startable. If a defense scores less than 7.499, then we should pick a different defense, etc.
We can finally set up our least squares system. Set the matrix A
to the variable game_hist
defined above. The components of the vector b
should have a value of +1.0 if the corresponding component of pts_scored
exceeds the threshold, and -1.0 if it lies below the threshold. (I chose the thresholds so that it is impossible for the points to equal the threshold).
Set up the right hand side vector, and solve the Linear Least Squares problem for $\mathbf{x}$. You can use numpy.linalg.lstsq
to compute the least-squares solution. Then compute a numpy array b_predict
that tests how this linear model performs on the data.
threshold = start_threshold[POS]
A = game_hist
b = ...
x = ...
b_predict = ...
We can have the following situations:
Compute the number of false positives, false negatives, and correct prediction. What percentage of each do we obtain on the data?
The model is only correct 60.57% of the time. However, it only return a "false positive" 3.39% of the time, which is very nice: if the model tells you to start a player, there's a good chance you will be happy with the results.
Let's put it all together into a single function. This will mostly be copying and pasting from above. The function should return the variables A
, b
, x
.
def linear_predictor(ff_data, Pos, min_games, n_games, threshold):
# clear
...
return A, b, x
We can call the routine for any position, and we can tweak the number of min_games
and n_games
. You can also tweak the threshold. Try changing the input variables and see how this affects model accuracy
Pos = 'WR'
min_games = 5
n_games = 3
threshold = start_threshold[Pos]
A, b, x = linear_predictor(ff_2018, Pos, min_games, n_games, threshold)
b_predict = ...
Notice we didn't make use of the fact that a player is playing on home or on the road, or the ranking of the opponent. Let's try to enrich the features used in this problem to include this data. Let's go back to Andy Dalton:
pl = relevant_players[13]
pl_points = df_POS[df_POS['Name']==pl]['YH points'].values
pl_home_away = df_POS[df_POS['Name']==pl]['home_away'].values
pl_oppt_rank = df_POS[df_POS['Name']==pl]['oppt_rank'].values
print('Player:', pl)
print('Points:', pl_points)
print('Location:', pl_home_away)
print('Opp Rank:', pl_oppt_rank)
When $n = 3$ we had the following system when we only took previous games played:
$$\mathbf{A} = \begin{pmatrix} 17.52 & 26.6 & 18.08\\ 26.6 & 18.08 & 25.78 \\ 18.08 & 25.78 & 13.92 \\ 25.78 & 13.92 & 17.16 \\ 13.92 & 17.16 & 8.92 \\ 17.16 & 8.92 & 20.2 \\ 8.92 & 20.2 & 8.92 \\ 20.2 & 8.92 & 19.34 \end{pmatrix}, \hspace{5mm} \mathbf{b}= \begin{pmatrix} 25.78 \\ 13.92 \\ 17.16 \\ 8.92 \\ 20.2 \\ 8.92 \\ 19.34 \\ 9.1\end{pmatrix}$$
With the location and opponent data, it should now look like this:
$$\mathbf{A} = \begin{pmatrix} 17.52 & 26.6 & 18.08 & -1 & 1\\ 26.6 & 18.08 & 25.78 & 1 & 10 \\ 18.08 & 25.78 & 13.92 & 1 & 17\\ 25.78 & 13.92 & 17.16 & -1 & 5\\ 13.92 & 17.16 & 8.92 & 1 & 4\\ 17.16 & 8.92 & 20.2 & 1 & 2\\ 8.92 & 20.2 & 8.92 & -1 & 29 \\ 20.2 & 8.92 & 19.34 & 1 & 13\end{pmatrix}, \hspace{5mm} \mathbf{b}= \begin{pmatrix} 25.78 \\ 13.92 \\ 17.16 \\ 8.92 \\ 20.2 \\ 8.92 \\ 19.34 \\ 9.1\end{pmatrix}$$
Create an enriched linear regression, by adding these two extra columns to the matrix $\mathbf{A}$. The routine should return A
with the two added columns. It should also return the right hand side b
and least-squares solution x
.
def linear_predictor_enriched(ff_data, Pos, min_games, n_games, threshold):
...
return A, b, x
You should see that the enriched version is considerably better for running backs, with our standard inputs:
Pos = 'RB'
min_games = 5
n_games = 3
threshold = start_threshold[Pos]
A, b, x = linear_predictor(ff_2018, Pos, min_games, n_games, threshold)
b_predict = ...
print('Standard Model')
print('Fraction of false negatives: ', ...)
print('Fraction of false positives: ', ...)
print('Fraction of correct predictions:', ...)
print()
A, b, x = linear_predictor_enriched(ff_2018, Pos, min_games, n_games, threshold)
b_predict = ...
print('Enriched Model')
print('Fraction of false negatives: ', ...)
print('Fraction of false positives: ', ...)
print('Fraction of correct predictions:', ...)
print()
But you'll find it's not very effective for quarterbacks:
Pos = 'QB'
min_games = 5
n_games = 3
threshold = start_threshold[Pos]
A, b, x = linear_predictor(ff_2018, Pos, min_games, n_games, threshold)
b_predict = ...
print('Standard Model')
print('Fraction of false negatives: ', ...)
print('Fraction of false positives: ', ...)
print('Fraction of correct predictions:', ...)
print()
A, b, x = linear_predictor_enriched(ff_2018, Pos, min_games, n_games, threshold)
b_predict = ...
print('Enriched Model')
print('Fraction of false negatives: ', ...)
print('Fraction of false positives: ', ...)
print('Fraction of correct predictions:', ...)
print()
The number of false positives has shot up dramatically. Despite the (slightly) better accuracy, I would probably avoid this one.
It seems that running backs are more "matchup-dependent" than quarterbacks. That is, where they are playing and how good the other team is are bigger factors in their performance compared to quarterbacks.
Of course, you never want to conclude anything about your model based on the data you used to construct it. You should validate its accuracy on a different data set. We can do so on this years fantasy football data. We can also select the optimal hyperparameters (a fancy word for parameters) based on this validation set.
Some questions to ask as you test the model on the validation set:
I.e. the inclusion of the extra data, the minimum number of games, and the history length are the hyperparameters for this model.
ff_2019 = pd.read_csv('FF-data-2019.csv')
# position
Pos = 'TE'
# these are your hyperparameters
min_games = 6
n_games = 3
enriched = True
# build model on 2018 data and retrieve least squares solution x
if enriched:
OUT_2018 = linear_predictor_enriched(ff_2018, Pos, min_games,n_games,threshold)
x = OUT_2018[2]
else:
OUT_2018 = linear_predictor(ff_2018, Pos, min_games,n_games,threshold)
x = OUT_2018[2]
# retrieve Data matrix A and outcomes vector b using 2019 data
if enriched:
OUT_2019 = linear_predictor_enriched(ff_2019, Pos, min_games,n_games,threshold)
A,b = OUT_2019[0], OUT_2019[1]
else:
OUT_2019 = linear_predictor(ff_2019, Pos, min_games,n_games,threshold)
A,b = OUT_2019[0], OUT_2019[1]
# assess model
b_predict = ...
print('Enriched Model')
print('Fraction of false negatives: ', ...)
print('Fraction of false positives: ', ...)
print('Fraction of correct predictions:', ...)
print()