The first thing you need to do is to download this file: mp01.zip. It has the following content:
submitted.py
: Your homework. Edit, and then submit to Gradescope.mp01_notebook.ipynb
: This is a Jupyter notebook to help you debug. You can completely ignore it if you want, although you might find that it gives you useful instructions.grade.py
: Once your homework seems to be working, you can test it by typing python grade.py
, which will run the tests in tests/tests_visible.py
.tests/test_visible.py
: This file contains about half of the unit tests that Gradescope will run in order to grade your homework. If you can get a perfect score on these tests, then you should also get a perfect score on the additional hidden tests that Gradescope uses.solution.json
: This file contains the solutions for the visible test cases, in JSON format. If the instructions are confusing you, please look at this file, to see if it can help to clear up your confusion.data
: This directory contains the data.reader.py
: This is an auxiliary program that you can use to read the data.requirements.txt
: This tells you which python packages you need to have installed, in order to run grade.py
. You can install all of those packages by typing pip install -r requirements.txt
or pip3 install -r requirements.txt
.This file (mp01_notebook.ipynb
) will walk you through the whole MP, giving you instructions and debugging tips as you go.
There are two types of data: visible data (provided to you), and hidden data (available only to the autograder on Gradescope). If you get your code working for the visible data, it should also work for the hidden data.
The visible dataset consist of 500 emails, a subset of the Enron-Spam dataset provided by Ion Androutsopoulos. MP02 will use a larger portion of the same dataset.
In order to help you load the data, we provide you with a utility function called reader.py
. Since its methods are correctly documented by docstrings, you can find information about each function by using help
:
import reader
help(reader)
Help on module reader: NAME reader - This file is responsible for providing functions for reading the files FUNCTIONS loadDir(dirname, stemming, lower_case, use_tqdm=True) Loads the files in the folder and returns a list of lists of words from the text in each file. Parameters: name (str): the directory containing the data stemming (bool): if True, use NLTK's stemmer to remove suffixes lower_case (bool): if True, convert letters to lowercase use_tqdm (bool, default:True): if True, use tqdm to show status bar Output: texts (list of lists): texts[m][n] is the n'th word in the m'th email count (int): number of files loaded loadFile(filename, stemming, lower_case) Load a file, and returns a list of words. Parameters: filename (str): the directory containing the data stemming (bool): if True, use NLTK's stemmer to remove suffixes lower_case (bool): if True, convert letters to lowercase Output: x (list): x[n] is the n'th word in the file DATA bad_words = {'aed', 'eed', 'oed'} porter_stemmer = <PorterStemmer> tokenizer = RegexpTokenizer(pattern='\\w+', gaps=False, disc...ty=True... FILE /Users/jhasegaw/Dropbox/mark/teaching/ece448/ece448labs/spring23/mp01/src/reader.py
Well, that's pretty straightforward. Let's use it to load the data
directory.
import importlib
importlib.reload(reader)
texts, count = reader.loadDir('data',False,False)
100%|██████████████████████████████████████████████████| 500/500 [00:00<00:00, 6554.26it/s]
print("There were",count,"files loaded")
There were 500 files loaded
print("The first file contained the following words:",texts[0])
The first file contained the following words: ['Subject', 'done', 'new', 'sitara', 'desk', 'request', 'ref', 'cc', '20000813', 'carey', 'per', 'scott', 's', 'request', 'below', 'the', 'following', 'business', 'unit', 'aka', 'desk', 'id', 'portfolio', 'was', 'added', 'to', 'global', 'production', 'and', 'unify', 'development', 'test', 'production', 'and', 'stage', 'please', 'copy', 'to', 'the', 'other', 'global', 'environments', 'thanks', 'dick', 'x', '3', '1489', 'updated', 'in', 'global', 'production', 'environment', 'gcc', 'code', 'desc', 'p', 'ent', 'subenti', 'data', '_', 'cd', 'ap', 'data', '_', 'desc', 'code', '_', 'id', 'a', 'sit', 'deskid', 'imcl', 'a', 'ena', 'im', 'cleburne', '9273', 'from', 'scott', 'mills', '08', '30', '2000', '08', '27', 'am', 'to', 'samuel', 'schott', 'hou', 'ect', 'ect', 'richard', 'elwood', 'hou', 'ect', 'ect', 'debbie', 'r', 'brackett', 'hou', 'ect', 'ect', 'judy', 'rose', 'hou', 'ect', 'ect', 'vanessa', 'schulte', 'corp', 'enron', 'enron', 'david', 'baumbach', 'hou', 'ect', 'ect', 'daren', 'j', 'farmer', 'hou', 'ect', 'ect', 'dave', 'nommensen', 'hou', 'ect', 'ect', 'donna', 'greif', 'hou', 'ect', 'ect', 'shawna', 'johnson', 'corp', 'enron', 'enron', 'russ', 'severson', 'hou', 'ect', 'ect', 'cc', 'subject', 'new', 'sitara', 'desk', 'request', 'this', 'needs', 'to', 'be', 'available', 'in', 'production', 'by', 'early', 'afternoon', 'sorry', 'for', 'the', 'short', 'notice', 'srm', 'x', '33548']
In this week's MP, we will work with the following two random variables:
... where you can specify word1 and word2 as parameters of the function. In this section, we will compute the joint, conditional, and marginal distributions of $X_1$ and $X_2$. These will be estimated, from the available data, using the following formulas, where $N(X_1=x_1,X_2=x_2)$ is the number of texts in the dataset that contain $x_1$ instances of word1, and $x_2$ instances of word2:
At this point, we'll load the file submitted.py
.
The file submitted.py
is the only part of your work that the autograder will see. The only purpose of this notebook is to help you debug submitted.py
. Once you have revised submitted.py
enough to make this notebook work, then you should go to the command line, and type python grade.py
. Once that command returns without errors, then you can go ahead and submit your file submitted.py
to the autograder. You can submit to the autograder as often as you want, but it will save you trouble if you debug as much as you can on your local machine, before you submit to the autograder.
We will use importlib
in order to reload your submitted.py
over and over again. That way, every time you make a modification in submitted.py
, you can just re-run the corresponding block of this notebook, and it will reload submitted.py
with your modified code.
Since the file is called submitted.py
, python considers it to contain a module called submitted
. As shown, you can read the module's docstring by printing submitted.__doc__
. You can also type help(submitted)
to get a lot of information about the module, including its docstring, a list of all the functions it defines, and all of their docstrings. For more about docstrings, see, for example, https://www.python.org/dev/peps/pep-0257/.
import submitted
import importlib
importlib.reload(submitted)
print(submitted.__doc__)
This is the module you'll submit to the autograder. There are several function definitions, here, that raise RuntimeErrors. You should replace each "raise RuntimeError" line with a line that performs the function specified in the function's docstring.
Now it's time for you to open submitted.py
, and start editing it. You can open it in another Jupyter window by choosing "Open from Path" from the "File" menu, and then typing submitted.py
. Alternatively, you can use any text editor.
Once you have it open, try editing the function joint_distribution_of_word_counts
so that its functionality matches its docstring. Here is what it's docstring says:
help(submitted.joint_distribution_of_word_counts)
Help on function joint_distribution_of_word_counts in module submitted: joint_distribution_of_word_counts(texts, word0, word1) Parameters: texts (list of lists) - a list of texts; each text is a list of words word0 (str) - the first word to count word1 (str) - the second word to count Output: Pjoint (numpy array) - Pjoint[m,n] = P(X1=m,X2=n), where X0 is the number of times that word1 occurs in a given text, X1 is the number of times that word2 occurs in the same text.
Edit joint_distribution_of_word_counts
so that it does the task specified in its docstring. When you get the code working, you can count the number of times that the words "Mr." and "company" co-occur. It turns out that 96.4% of all texts contain neither word. 2.4% of texts contain the word "company" just once, 0.2% contain it twice, 0.2% contain it four times. 0.6% contain the word "Mr." just once, 0.2% contain it four times. There are no files in the whole database that contain both words together!
importlib.reload(submitted)
Pjoint = submitted.joint_distribution_of_word_counts(texts, 'mr', 'company')
print(Pjoint)
[[0.964 0.024 0.002 0. 0.002] [0.006 0. 0. 0. 0. ] [0. 0. 0. 0. 0. ] [0. 0. 0. 0. 0. ] [0.002 0. 0. 0. 0. ]]
Now, edit the functions marginal_distribution_of_word_counts
and conditional_distribution_of_word_counts
. The results you should get are shown below, and are also available to you in the file solutions.json
.
importlib.reload(submitted)
P0 = submitted.marginal_distribution_of_word_counts(Pjoint, 0)
print(P0)
[0.992 0.006 0. 0. 0.002]
importlib.reload(submitted)
P1 = submitted.marginal_distribution_of_word_counts(Pjoint, 1)
print(P1)
[0.972 0.024 0.002 0. 0.002]
import numpy as np
importlib.reload(submitted)
Pcond = submitted.conditional_distribution_of_word_counts(Pjoint, P0)
print("Conditional distribution table:")
print(Pcond)
print("\nSums of the rows:")
print(np.sum(Pcond, axis=1))
Conditional distribution table: [[0.97177419 0.02419355 0.00201613 0. 0.00201613] [1. 0. 0. 0. 0. ] [ nan nan nan nan nan] [ nan nan nan nan nan] [1. 0. 0. 0. 0. ]] Sums of the rows: [ 1. 1. nan nan 1.]
In order to study mean, variance and covariance, let's first find the joint distribution of some pair of words that occur more frequently. How about "a" and "the"? Amazingly, as the following code, there is a small nonzero probability that "a" occurs 19 times, and "the" occurs 58 times, in the same text!
importlib.reload(submitted)
Pathe = submitted.joint_distribution_of_word_counts(texts, 'a', 'the')
print("Here is the joint distribution:")
print(Pathe)
print("\n It has size", Pathe.shape)
Here is the joint distribution: [[0.248 0.078 0.056 ... 0. 0. 0. ] [0.036 0.028 0.026 ... 0. 0. 0. ] [0.006 0.006 0.014 ... 0. 0. 0. ] ... [0. 0. 0. ... 0. 0. 0. ] [0. 0. 0. ... 0. 0. 0. ] [0. 0. 0. ... 0. 0. 0.002]] It has size (20, 59)
importlib.reload(submitted)
Pthe = submitted.marginal_distribution_of_word_counts(Pathe, 1)
print("Counts of the word /the/ have the following distribution:")
print(Pthe)
Counts of the word /the/ have the following distribution: [0.296 0.122 0.106 0.09 0.076 0.056 0.026 0.04 0.032 0.026 0.016 0.01 0.014 0.008 0.014 0.006 0.008 0.004 0.008 0.002 0.004 0.002 0. 0.002 0. 0.008 0.01 0.002 0. 0.006 0. 0. 0. 0. 0. 0.004 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.002]
Now let's calculate mean, variance, and covariance. First, look at their docstrings:
importlib.reload(submitted)
help(submitted.mean_from_distribution)
Help on function mean_from_distribution in module submitted: mean_from_distribution(P) Parameters: P (numpy array) - P[n] = P(X=n) Outputs: mu (float) - the mean of X
importlib.reload(submitted)
help(submitted.variance_from_distribution)
Help on function variance_from_distribution in module submitted: variance_from_distribution(P) Parameters: P (numpy array) - P[n] = P(X=n) Outputs: var (float) - the variance of X
importlib.reload(submitted)
help(submitted.covariance_from_distribution)
Help on function covariance_from_distribution in module submitted: covariance_from_distribution(P) Parameters: P (numpy array) - P[m,n] = P(X0=m,X1=n) Outputs: covar (float) - the covariance of X0 and X1
Now that you understand them, try editing submitted.py
so that these functions perform the specified tasks. You should get the following results (which are also provided to you in the file solutions.json
):
importlib.reload(submitted)
mu_the = submitted.mean_from_distribution(Pthe)
print(mu_the)
4.432
importlib.reload(submitted)
var_the = submitted.variance_from_distribution(Pthe)
print(var_the)
41.601376
importlib.reload(submitted)
covar_a_the = submitted.covariance_from_distribution(Pathe)
print(covar_a_the)
9.244752
Now, let's calculate the expected value of an arbitrary function of a random variable. If $f(x_0,x_1)$ is some real-valued function of variables $x_0$ and $x_1$, then its expected value is:
$$E\left[f(X_0,X_1)\right]=\sum_{x_0,x_1} f(x_0,x_1) P(X_0=x_0,X_1=x_1)$$Let's read the docstring:
importlib.reload(submitted)
help(submitted.expectation_of_a_function)
Help on function expectation_of_a_function in module submitted: expectation_of_a_function(P, f) Parameters: P (numpy array) - joint distribution, P[m,n] = P(X0=m,X1=n) f (function) - f should be a function that takes two real-valued inputs, x0 and x1. The output, z=f(x0,x1), must be a real number for all values of (x0,x1) such that P(X0=x0,X1=x1) is nonzero. Output: expected (float) - the expected value, E[f(X0,X1)]
The function needs to produce real-valued outputs for all allowable (x0,x1)
pairs, but otherwise, it can be as weird as we like. For example, let's define it as follows:
import numpy as np
def f(x0,x1):
return(np.log(x0+1) + np.log(x1+1))
print("f(0,0) is",f(0,0))
print("f(0,15) is",f(0,15))
print("f(1,1) is",f(1,1))
print("f(19,58) is",f(19,58))
f(0,0) is 0.0 f(0,15) is 2.772588722239781 f(1,1) is 1.3862943611198906 f(19,58) is 7.073269717459711
importlib.reload(submitted)
expected = submitted.expectation_of_a_function(Pathe, f)
print(expected)
1.7722821489053828
If you've reached this point, and all of the above sections work, then you're ready to try grading your homework! Before you submit it to Gradescope, try grading it on your own machine. This will run some visible test cases (which you can read in tests/test_visible.py
), and compare the results to the solutions (which you can read in solution.json
).
The exclamation point (!) tells python to run the following as a shell command. Obviously you don't need to run the code this way -- this usage is here just to remind you that you can also, if you wish, run this command in a terminal window.
!python grade.py
/Users/jhasegaw/Dropbox/mark/teaching/ece448/ece448labs/spring23/mp01/src/submitted.py:67: RuntimeWarning: invalid value encountered in true_divide Pcond[m,:] = Pjoint[m,:] / Pmarginal[m] ...... ---------------------------------------------------------------------- Ran 6 tests in 0.076s OK
If you got any 'E' marks, it means that your code generated some runtime errors, and you need to debug those.
If you got any 'F' marks, it means that your code ran without errors, but that it generated results that are different from the solutions in solutions.json
. Try debugging those differences.
If neither of those things happened, and your result was a series of dots, then your code works perfectly.
If you're not sure, you can try running grade.py with the -j option. This will produce a JSON results file, in which the best score you can get is 50.
!python grade.py -j
{ "tests": [ { "name": "test_cond (test_visible.TestStep)", "score": 8, "max_score": 8, "output": "\n/Users/jhasegaw/Dropbox/mark/teaching/ece448/ece448labs/spring23/mp01/src/submitted.py:67: RuntimeWarning: invalid value encountered in true_divide\n Pcond[m,:] = Pjoint[m,:] / Pmarginal[m]\n" }, { "name": "test_covariance (test_visible.TestStep)", "score": 8, "max_score": 8 }, { "name": "test_expected (test_visible.TestStep)", "score": 8, "max_score": 8 }, { "name": "test_joint (test_visible.TestStep)", "score": 9, "max_score": 9 }, { "name": "test_marginal (test_visible.TestStep)", "score": 9, "max_score": 9 }, { "name": "test_mean (test_visible.TestStep)", "score": 8, "max_score": 8 } ], "leaderboard": [], "visibility": "visible", "execution_time": "0.06", "score": 50 }
Now you should try uploading submitted.py
to Gradescope.
Gradescope will run the same visible tests that you just ran on your own machine, plus some additional hidden tests. It's possible that your code passes all the visible tests, but fails the hidden tests. If that happens, then it probably means that you hard-coded a number into your function definition, instead of using the input parameter that you were supposed to use. Debug by running your function with a variety of different input parameters, and see if you can get it to respond correctly in all cases.
Once your code works perfectly on Gradescope, with no errors, then you are done with the MP. Congratulations!