CS440/ECE448 Spring 2026¶

MP02: Naive Bayes¶

The first thing you need to do is to download mp02.zip. Content is similar to MP01.

This file (mp02_notebook.ipynb) will walk you through the whole MP, giving you instructions and debugging tips as you go.

Table of Contents¶

  1. Reading the Data
  2. Learning a Naive Bayes Model: Maximum Likelihood
  3. Learning a Naive Bayes Model: Stop Words
  4. Learning a Naive Bayes Model: Laplace Smoothing
  5. Decisions Using a Naive Bayes Model
  6. Optimizing Hyperparameters
  7. Grade Your Homework

Reading the data¶

The dataset in your template package consists of sets of human and LLM text generated by prompting an LLM to continue human text in the same style, across academic, news article, fiction, podcast, blog, and TV/movie contexts. It is a subset of the HAP-E dataset, which was originally introduced by this paper.

In order to help you load the data, we provide you with a utility module called reader.py. This module contains the loadDataset function, which loads text from the provided dataset given specific subsets of data to load.

In [1]:
# First, install requirements
# You can reuse your environment from MP01
!python -m pip install pandas -q
!python -m pip install fastparquet -q
!python -m pip install tqdm -q
!python -m pip install nltk -q
In [2]:
import reader, importlib
importlib.reload(reader)
help(reader.loadDataset)
Help on function loadDataset in module reader:

loadDataset(filename, subsets, stemming=False)
    Load a file, and returns a list of texts.

    Parameters:
    filename (str): the .parquet file containing the data
    subsets (list of str): List of subsets
                           (in 'acad', 'blog', 'news', 'fic', 'spok', 'tvm') to load
    stemming (bool, optional): if True, use NLTK's stemmer to remove suffixes

    Output:
    texts (list of lists): texts[m][n] is the n'th word in the m'th text in the dataset
    count (int): number of files loaded

For the initial part of this MP, we will work with distinguishing between academic text, fiction text, and news text

In [3]:
importlib.reload(reader)

train = {
    'acad': reader.loadDataset('data/train/human.parquet', ['acad']),
    'fic': reader.loadDataset('data/train/human.parquet', ['fic']),
    'news': reader.loadDataset('data/train/human.parquet', ['news']),
}
In [4]:
for y in train.keys():
    print("There were",len(train[y]),"texts loaded for class",y)
There were 1104 texts loaded for class acad
There were 1255 texts loaded for class fic
There were 1189 texts loaded for class news
In [5]:
print("The first fiction text is:",train['fic'][0])
The first fiction text is: ['he', 'was', 'a', 'wiry', 'little', 'chap', 'with', 'bright', 'eyes', 'for', 'ever', 'on', 'the', 'twinkle', 'and', 'black', 'hair', 'pasted', 'down', 'upon', 'his', 'head', 'so', 'as', 'not', 'to', 'show', 'the', 'slightest', 'vestige', 'of', 'curl', 'while', 'the', 'sharp', 'mischievous', 'look', 'on', 'his', 'face', 'and', 'the', 'quick', 'comical', 'movements', 'of', 'his', 'body', 'suggested', 'something', 'between', 'a', 'terrier', 'and', 'a', 'monkey', 'there', 'was', 'never', 'very', 'much', 'going', 'on', 'in', 'the', 'way', 'of', 'regular', 'sports', 'or', 'pastimes', 'at', 'the', 'birches', 'the', 'smallness', 'of', 'numbers', 'made', 'it', 'difficult', 'to', 'attempt', 'proper', 'games', 'of', 'cricket', 'or', 'football', 'and', 'the', 'boys', 'were', 'forced', 'to', 'content', 'themselves', 'with', 'such', 'substitutes', 'as', 'prisoner', 's', 'base', 'cross', 'tag', 'etc', 'or', 'in', 'carrying', 'out', 'the', 'projects', 'of', 'fred', 'acton', 'who', 'was', 'constantly', 'making', 'suggestions', 'for', 'the', 'employment', 'of', 'their', 'time', 'and', 'compelling', 'everybody', 'to', 'conform', 'to', 'his', 'wishes', 'mr', 'welsby', 'had', 'been', 'a', 'widower', 'for', 'many', 'years', 'he', 'was', 'a', 'grave', 'scholarly', 'man', 'who', 'spent', 'most', 'of', 'his', 'spare', 'time', 'in', 'his', 'own', 'library', 'mr', 'blake', 'was', 'supposed', 'to', 'take', 'charge', 'out', 'of', 'school', 'hours', 'he', 'was', 'as', 'every', 'one', 'said', 'a', 'jolly', 'fellow', 'and', 'the', 'fact', 'that', 'his', 'popularity', 'extended', 'far', 'and', 'wide', 'among', 'a', 'large', 'circle', 'of', 'friends', 'and', 'acquaintances', 'caused', 'him', 'to', 'have', 'a', 'good', 'many', 'irons', 'in', 'the', 'fire', 'of', 'one', 'sort', 'and', 'another', 'during', 'their', 'hours', 'of', 'leisure', 'therefore', 'the', 'birchites', 'were', 'left', 'pretty', 'much', 'to', 'their', 'own', 'devices', 'or', 'more', 'often', 'to', 'those', 'of', 'master', 'fred', 'acton', 'who', 'liked', 'as', 'has', 'already', 'been', 'stated', 'to', 'assume', 'the', 'office', 'of', 'bellwether', 'to', 'the', 'little', 'flock', 'at', 'the', 'time', 'when', 'our', 'story', 'commences', 'the', 'ground', 'was', 'covered', 'with', 'snow', 'but', 'acton', 'was', 'equal', 'to', 'the', 'occasion', 'and', 'as', 'soon', 'as', 'dinner', 'was', 'over', 'ordered', 'all', 'hands', 'to', 'come', 'outside', 'and', 'make', 'a', 'slide', 'the', 'garden', 'was', 'on', 'a', 'steep', 'slope', 'along', 'the', 'bottom', 'of', 'which', 'ran', 'the', 'brick', 'wall', 'bounding', 'one', 'side', 'of', 'the', 'playground', 'a', 'straight', 'steep', 'path', 'lay', 'between', 'this', 'and', 'the', 'house', 'and', 'the', 'youthful', 'dux', 'with', 'his', 'usual', 'disregard', 'of', 'life', 'and', 'limb', 'insisted', 'on', 'choosing', 'this', 'as', 'the', 'scene', 'of', 'operations', 'what', 'he', 'cried', 'in', 'answer', 'to', 'a', 'feeble', 'protest', 'on', 'the', 'part', 'of', 'mugford', 'make', 'it', 'on', 'level', 'ground', 'of', 'course', 'not', 'when', 'we', 've', 'got', 'this', 'jolly', 'hill', 'to', 'go', 'down', 'not', 'if', 'i', 'know', 'it', 'we', 'll', 'open', 'the', 'door', 'at', 'the', 'bottom', 'and', 'go', 'right', 'on', 'into', 'the', 'playground', 'but', 'how', 'if', 'any', 'one', 'goes', 'a', 'bit', 'crooked', 'and', 'runs', 'up', 'against', 'the', 'bricks', 'well', 'they', 'll', 'get', 'pretty', 'well', 'smashed', 'or', 'he', 'will', 'you', 'must', 'go', 'straight', 'that', 's', 'half', 'the', 'fun', 'of', 'the', 'thing', 'it', 'll', 'make', 'it', 'all', 'the', 'more', 'exciting', 'come', 'on', 'and', 'begin', 'to', 'tread', 'down', 'the', 'snow', 'without', 'daring', 'to', 'show', 'any', 'outward', 'signs', 'of', 'reluctance', 'but', 'with', 'feelings', 'very', 'much', 'akin', 'to', 'those', 'of', 'men', 'digging', 'their', 'own', 'graves', 'before', 'being', 'shot', 'the', 'company', 'set', 'about', 'putting', 'this', 'fearful', 'project', 'into', 'execution']

Learning a Naive Bayes Model: Maximum Likelihood¶

Naive Bayes is a classification strategy that classifies based on estimating the conditional probability of a class given the evidence. For text classification, it is more useful to look at bigrams (or consecutive word pairs) as evidence, instead of single words. In our implementation, we will consider both individual bigram tokens and the prevalence of bigram types

  • Bigram token: A bigram token consists of two consecutive word tokens from the text. For grading purpose, all bigrams are represented as word1*-*-*-*word2 , i.e. *-*-*-* as a separator, and not as a tuple. In the context of bigrams, the tokens are these pairs of words. To emphasize, note that our definition of bigram is two "consecutive" word tokens. Consecutive!
  • Bigram type: Bigram types are the set of unique pairs of consecutive words that occurred in a text. The number of bigram types in the $n^{\text{th}}$ positive text can be found by first generating all bigram tokens from the text and then counting the unique bigrams.

A Naive Bayes model consists of two types of probability distributions:

  • The prior is the distribution over classes, $P(\text{Class})$.
  • The likelihood is the probability of a bigram token given a particular class, $P(\text{Bigram}|\text{Class})$.

The prior can be estimated from the training data. In the training data we've loaded so far, $P(\text{Class}=\text{acad})=0.311$.

Often, though, the testing data will have a different class distribution than the training data. If you don't know the testing priors, then it's sometimes best to just assume a uniform distribution, i.e., in this case with three classes $P(\text{Class}=\text{acad})=P(\text{Class}=\text{fic})=P(\text{Class}=\text{news})\approx 0.333$.

The likelihood is the informative part of a Naive Bayes model: it tells you which words are used more often in one class versus another.

There are many ways in which you can estimate the likelihood. The following formula is called the maximum likelihood estimate, because it maximizes the likelihood of the words in your training dataset:

$$P(\text{Bigram}=x|\text{Class}=y)=\frac{\text{\# tokens of bigram}~x~\text{in texts of class}~y}{\text{\# tokens of any bigram in texts of class}~y}$$

In this part of the MP, you will first estimate what are called frequency tables. The frequency of $x$ given $y$ is the number of times that bigram $x$ occurred in texts of class $y$. The relevant method in submitted.py is the one called create_frequency_table:

In [6]:
import submitted, importlib
importlib.reload(submitted)
help(submitted.create_frequency_table)
Help on function create_frequency_table in module submitted:

create_frequency_table(train)
    Parameters:
    train (dict of list of lists)
        - train[y][i][k] = k'th token of i'th text of class y

    Output:
    frequency (dict of Counters):
        - frequency[y][x] = number of occurrences of bigram x in texts of class y,
          where x is in the format 'word1*-*-*-*word2'

Edit create_frequency_table so that it does what its docstring says it should do.

Hint: your code will be shorter if you use the python data structure called a Counter.

When your code works, you should get the following results:

In [7]:
importlib.reload(submitted)
frequency = submitted.create_frequency_table(train)

print("Frequency of 'machine learning' in academic text: ",frequency['acad']['machine*-*-*-*learning'])
print("Frequency of 'machine learning' in fiction text:  ",frequency['fic']['machine*-*-*-*learning'])
print("Frequency of 'machine learning' in news text:     ",frequency['news']['machine*-*-*-*learning'])
print("\n")

print("Frequency of 'years ago' in academic text: ",frequency['acad']['years*-*-*-*ago'])
print("Frequency of 'years ago' in fiction text:  ",frequency['fic']['years*-*-*-*ago'])
print("Frequency of 'years ago' in news text:     ",frequency['news']['years*-*-*-*ago'])
print("\n")


print("Frequency of 'of the' in academic text: ",frequency['acad']['of*-*-*-*the'])
print("Frequency of 'of the' in fiction text:  ",frequency['fic']['of*-*-*-*the'])
print("Frequency of 'of the' in news text:     ",frequency['news']['of*-*-*-*the'])
print("\n")

print("Frequency of 'to be' in academic text: ",frequency['acad']['to*-*-*-*be'])
print("Frequency of 'to be' in fiction text:  ",frequency['fic']['to*-*-*-*be'])
print("Frequency of 'of the' in news text:     ",frequency['news']['to*-*-*-*be'])
print("\n")

print("Frequency of 'and the' in academic text: ",frequency['acad']['and*-*-*-*the'])
print("Frequency of 'and the' in fiction text:  ",frequency['fic']['and*-*-*-*the'])
print("Frequency of 'of the' in news text:     ",frequency['news']['to*-*-*-*be'])
print("\n")

print("--------------------------------------\n")

print("Total # tokens in acad texts is",sum(frequency['acad'].values()))
print("Total # tokens in fic texts is",sum(frequency['fic'].values()))
print("\n")

print("Total # types in acad texts is",len(frequency['acad'].keys()))
print("Total # types in fic texts is",len(frequency['fic'].keys()))
Frequency of 'machine learning' in academic text:  29
Frequency of 'machine learning' in fiction text:   0
Frequency of 'machine learning' in news text:      0


Frequency of 'years ago' in academic text:  7
Frequency of 'years ago' in fiction text:   50
Frequency of 'years ago' in news text:      101


Frequency of 'of the' in academic text:  3944
Frequency of 'of the' in fiction text:   4522
Frequency of 'of the' in news text:      3165


Frequency of 'to be' in academic text:  714
Frequency of 'to be' in fiction text:   989
Frequency of 'of the' in news text:      955


Frequency of 'and the' in academic text:  995
Frequency of 'and the' in fiction text:   1363
Frequency of 'of the' in news text:      955


--------------------------------------

Total # tokens in acad texts is 530053
Total # tokens in fic texts is 606614


Total # types in acad texts is 262370
Total # types in fic texts is 260166

Learning a Naive Bayes model: Stop words¶

There are many common word pairs (bigrams) that may seem to be unrelated to classifying text type. Due to the nature of the training data, it's possible that some bigrams, consisting entirely of common words like "is", "of", "and", etc., are more frequent in one part of the training data than another. This can be problematic, as it means a test review might be wrongly classified as just because it contains many instances of innocuous bigrams like ("and", "the").

A "stopword list" is a list of words that should not be considered when you classify a test text. In the context of bigrams, we consider a bigram as a stopword if both of its constituent words are in the stopword list. There are many candidate stopword lists available on the internet. The stopword list that we've provided for you is based on this one: https://www.ranks.nl/stopwords.

To emphasize, we consider a bigram as a stopword if "both" of its constituent words are in the stopword list. Both!

Here is our stopword list:

In [8]:
importlib.reload(submitted)
print(sorted(submitted.stopwords))
["'d", "'ll", "'m", "'re", "'s", "'t", "'ve", 'a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'cannot', 'could', 'couldn', 'did', 'didn', 'do', 'does', 'doesn', 'doing', 'don', 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', 'has', 'hasn', 'have', 'haven', 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', 'it', 'its', 'itself', 'let', 'll', 'me', 'more', 'most', 'mustn', 'my', 'myself', 'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'ought', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'same', 'shan', 'she', 'should', 'shouldn', 'so', 'some', 'such', 'than', 'that', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 'very', 'was', 'wasn', 'we', 'were', 'weren', 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'with', 'won', 'would', 'wouldn', 'you', 'your', 'yours', 'yourself', 'yourselves']

To effectively avoid counting bigrams that are considered stopwords, two steps are necessary:

  1. Pretend that the frequency of bigrams, where both words are stopwords, in the training corpus is zero.
  2. Ignore such bigrams if they occur in testing data.

In this part of the MP, you should set the frequencies of those bigram stopwords to zero. Use the del command (see Counters), so that these bigrams don't get counted among either the bigram types or the bigram tokens.

In [9]:
importlib.reload(submitted)
help(submitted.remove_stopwords)
Help on function remove_stopwords in module submitted:

remove_stopwords(frequency)
    Parameters:
    frequency (dict of Counters):
        - frequency[y][x] = number of occurrences of bigram x in texts of class y,
          where x is in the format 'word1*-*-*-*word2'
    stopwords (set of str):
        - Set of stopwords to be excluded

    Output:
    nonstop (dict of Counters):
        - nonstop[y][x] = frequency of bigram x in texts of class y,
          but only if either token in x is not a stopword. x is in the format 'word1*-*-*-*word2'

In [10]:
importlib.reload(submitted)
nonstop = submitted.remove_stopwords(frequency)

print("Frequency of 'machine learning' in academic text before stopword removal: ",frequency['acad']['machine*-*-*-*learning'])
print("Frequency of 'machine learning' in academic text after stopword removal:  ",nonstop['acad']['machine*-*-*-*learning'])
print("\n")


print("Frequency of 'of the' in academic text before stopword removal: ",frequency['acad']['of*-*-*-*the'])
print("Frequency of 'of the' in academic text after stopword removal:  ",nonstop['acad']['of*-*-*-*the'])
print("\n")

print("Frequency of 'to be' in academic text before stopword removal: ",frequency['acad']['to*-*-*-*be'])
print("Frequency of 'to be' in academic text after stopword removal:  ",nonstop['acad']['to*-*-*-*be'])
print("\n")

print("Frequency of 'and the' in academic text before stopword removal: ",frequency['acad']['and*-*-*-*the'])
print("Frequency of 'and the' in academic text after stopword removal:  ",nonstop['acad']['and*-*-*-*the'])
print("\n")

print("--------------------------------------\n")

print("Total acad frequency:",sum(frequency['acad'].values()))
print("Total acad non-stopwords",sum(nonstop['acad'].values()))
print("\n")

print("Total # types in acad texts is",len(frequency['acad'].keys()))
print("Total # non-stopwords in acad is",len(nonstop['acad'].keys()))

print("Length of the stopwords set is:",len(submitted.stopwords))
Frequency of 'machine learning' in academic text before stopword removal:  29
Frequency of 'machine learning' in academic text after stopword removal:   29


Frequency of 'of the' in academic text before stopword removal:  3944
Frequency of 'of the' in academic text after stopword removal:   0


Frequency of 'to be' in academic text before stopword removal:  714
Frequency of 'to be' in academic text after stopword removal:   0


Frequency of 'and the' in academic text before stopword removal:  995
Frequency of 'and the' in academic text after stopword removal:   0


--------------------------------------

Total acad frequency: 530053
Total acad non-stopwords 481950


Total # types in acad texts is 262370
Total # non-stopwords in acad is 259798
Length of the stopwords set is: 150

Learning a Naive Bayes model: Laplace Smoothing¶

The maximum likelihood formula results in some bigram types having zero probability, just because they were not contained in your training data. This is a major weakness of smoothing-free Naive Bayes, as a single zero-probability bigram will cause the probability of a text to be zero!

A better formula is given by Laplace smoothing, according to which

$$P(\text{Bigram}=x|\text{Class}=y)=\frac{\left(\text{\# tokens of bigram}~x~\text{in texts of class}~y\right)+k}{\left(\text{\# tokens of any bigram in texts of class}~y\right)+k\times\left(\text{\# of bigram types}+1\right)}$$

...where $k$ is a hyperparameter that is usually chosen by trying several different values, and choosing the value that gives you the best accuracy on your development dataset.

The +1 in the denominator is used to account for bigrams that were never seen in the training dataset for class $y$. All such words are mapped to the type OOV (out of vocabulary), which has the likelihood

$$P(\text{Token}=\text{OOV}|\text{Class}=y)=\frac{k}{\left(\text{\# tokens of any bigram in texts of class}~y\right)+k\times\left(\text{\# of bigram types}+1\right)}$$

In this part of the MP, the method you'll create in submitted.py is called laplace_smoothing.

In [11]:
importlib.reload(submitted)
help(submitted.laplace_smoothing)
Help on function laplace_smoothing in module submitted:

laplace_smoothing(nonstop, smoothness)
    Parameters:
    nonstop (dict of Counters)
        - nonstop[y][x] = frequency of bigram x in y, where x is in the format 'word1*-*-*-*word2'
          and neither word1 nor word2 is a stopword
    smoothness (float)
        - smoothness = Laplace smoothing hyperparameter

    Output:
    likelihood (dict of dicts)
        - likelihood[y][x] = Laplace-smoothed likelihood of bigram x given y,
          where x is in the format 'word1*-*-*-*word2'
        - likelihood[y]['OOV'] = likelihood of an out-of-vocabulary bigram given y


    Important:
    Be careful that your vocabulary only counts bigrams that occurred at least once
    in the training data for class y.

In [12]:
importlib.reload(submitted)
likelihood = submitted.laplace_smoothing(nonstop, 0.001)

print("Likelihood of 'years ago' in academic text: ",likelihood['acad']['years*-*-*-*ago'])
print("Likelihood of 'years ago' in fiction text:  ",likelihood['fic']['years*-*-*-*ago'])
print("Likelihood of 'years ago' in news text:     ",likelihood['news']['years*-*-*-*ago'])
print("\n")

print("Likelihood of OOV bigram type in academic text: ",likelihood['acad']['OOV'])
print("Likelihood of OOV bigram type in fiction text:  ",likelihood['fic']['OOV'])
print("Likelihood of OOV bigram type in news text:     ",likelihood['news']['OOV'])
print("\n")

print("(should be approx. 1): Likelihood['acad'] sums to",sum(likelihood['acad'].values()))
print("(should be approx. 1): Likelihood['fic']  sums to",sum(likelihood['fic'].values()))
print("(should be approx. 1): Likelihood['news'] sums to",sum(likelihood['news'].values()))
Likelihood of 'years ago' in academic text:  1.451857679897542e-05
Likelihood of 'years ago' in fiction text:   0.0001045642072571625
Likelihood of 'years ago' in news text:      0.0002047836458885437


Likelihood of OOV bigram type in academic text:  2.073786144690104e-09
Likelihood of OOV bigram type in fiction text:   2.0912423202968444e-09
Likelihood of OOV bigram type in news text:      2.0275407757204748e-09


(should be approx. 1): Likelihood['acad'] sums to 0.9999999999999999
(should be approx. 1): Likelihood['fic']  sums to 0.9999999999999999
(should be approx. 1): Likelihood['news'] sums to 0.9999999999999999

Decisions using a Naive Bayes model¶

Suppose you are given a text, which is just a list of word tokens, $x=[x_1,\ldots,x_n]$. You want to decide whether this text is academic, fiction, or news. According to decision theory, the probability of error is minimized by the following rule:

$$\text{Estimated Class}=\left\{\begin{array}{ll} \text{acad}\phantom{oooio}~\text{if}~P(\text{Class}=\text{acad}|\text{Text}=x) > P(\text{Class}=\text{fic}|\text{Text}=x), \quad P(\text{Class}=\text{acad}|\text{Text}=x) > P(\text{Class}=\text{news}|\text{Text}=x)\\ \text{fic}\phantom{ooooioo}~\text{if}~P(\text{Class}=\text{fic}|\text{Text}=x) > P(\text{Class}=\text{acad}|\text{Text}=x), \quad P(\text{Class}=\text{fic}|\text{Text}=x) > P(\text{Class}=\text{news}|\text{Text}=x)\\ \text{news}\phantom{oooio}~\text{if}~P(\text{Class}=\text{news}|\text{Text}=x) > P(\text{Class}=\text{acad}|\text{Text}=x), \quad P(\text{Class}=\text{news}|\text{Text}=x) > P(\text{Class}=\text{fic}|\text{Text}=x)\\ \text{undecided}~\text{if}~P(\text{Class}=\text{acad}|\text{Text}=x)= P(\text{Class}=\text{fic}|\text{Text}=x), \quad P(\text{Class}=\text{fic}|\text{Text}=x) > P(\text{Class}=\text{news}|\text{Text}=x)\\ \text{undecided}~\text{if}~P(\text{Class}=\text{fic}|\text{Text}=x)= P(\text{Class}=\text{news}|\text{Text}=x), \quad P(\text{Class}=\text{news}|\text{Text}=x) > P(\text{Class}=\text{acad}|\text{Text}=x)\\ \end{array}\right.$$

Note that the latter two cases narrow it down to two more likely possibilities, but for this MP we just treat them as both undecided.

For a bigram model, the text $x$ is considered as a sequence of bigram tokens $[x_1, x_2, x_3, ... , x_n]$. Note that each $x_i$ here is not a single word, but a bigram! The posterior probabilities $P(\text{Class}|\text{Text})$ can be estimated using the Naive Bayes model:

$$P(\text{Class}=y|\text{Text}=x)=\frac{P(\text{Class}=y)}{P(\text{Text}=x)}\prod_{i\not\in\text{stopword bigrams},i=1}^{n}P(\text{Token}=x_i|\text{Class}=y)$$

Implementation Details¶

Notice some details:

  1. The term $P(\text{Text}=x)$ doesn't depend on $y$. If you're trying to figure out which is bigger, $P(\text{pos}|x)$ or $P(\text{neg}|x)$, then you don't need to calculate it.
  2. Multiplying together $n$ probabilities will result in a number that your computer might round down to 0. In order to prevent that, take the logarithm of both sides of the equation above.
  3. If $x_i$ is a stopword bigram, don't calculate its likelihood. If it isn't a stopword bigram, but it doesn't have an entry in likelihood[y], then you should use likelihood[y]["OOV"].

Implementation¶

For this part of the MP, finish the method called submitted.naive_bayes:

In [13]:
importlib.reload(submitted)
help(submitted.naive_bayes)
Help on function naive_bayes in module submitted:

naive_bayes(texts, likelihood, prior, classes)
    Parameters:
    texts (list of lists) -
        - texts[i][k] = k'th token of i'th text
    likelihood (dict of dicts)
        - likelihood[y][x] = Laplace-smoothed likelihood of bigram x given y,
          where x is in the format 'word1*-*-*-*word2'
    prior (list of floats)
        - The prior probability of each class, in the same order as classes
    classes (list of strings)
        - the classes represented in the likelihood dict

    Output:
    hypotheses (list)
        - hypotheses[i] = class label for the i'th text

Let's load the validation set, then try classifying it with, say, a uniform prior:

In [14]:
importlib.reload(reader)
validation = {
    'acad': reader.loadDataset('data/dev/human.parquet', ['acad']),
    'fic': reader.loadDataset('data/dev/human.parquet', ['fic']),
    'news': reader.loadDataset('data/dev/human.parquet', ['news']),
}
labels, texts = reader.val_set_helper(validation)
In [15]:
importlib.reload(submitted)
hypotheses = submitted.naive_bayes(texts, likelihood, [1./3, 1./3, 1./3], ['acad', 'fic', 'news'])

for y in ['acad','fic','news','undecided']:
    print("There are",hypotheses.count(y),'examples that were labeled with class',y)
There are 123 examples that were labeled with class acad
There are 141 examples that were labeled with class fic
There are 132 examples that were labeled with class news
There are 0 examples that were labeled with class undecided
In [16]:
print(len(hypotheses)) 
print(len(labels))
396
396
In [17]:
print("The accuracy of the classifier on the validation set is:")

count_correct = 0
for (y,yhat) in zip(labels, hypotheses):
    if y==yhat:
        count_correct += 1
        
print(count_correct / len(labels))
The accuracy of the classifier on the validation set is:
0.9974747474747475

Optimizing Hyperparameters¶

Your classifier should be working extremely well so far! Let's now see if we can use it to determine if text is generated by an LLM:

In [18]:
importlib.reload(reader)
train = {
    'human': reader.loadDataset('data/train/human.parquet', ['acad', 'blog', 'news', 'fic', 'spok', 'tvm']),
    'llm': reader.loadDataset('data/train/llm.parquet', ['acad', 'blog', 'news', 'fic', 'spok', 'tvm']),
}
validation = {
    'human': reader.loadDataset('data/dev/human.parquet', ['acad', 'blog', 'news', 'fic', 'spok', 'tvm']),
    'llm': reader.loadDataset('data/dev/llm.parquet', ['acad', 'blog', 'news', 'fic', 'spok', 'tvm']),
}
labels, texts = reader.val_set_helper(validation)
print ("Loading complete")
Loading complete
In [19]:
print("Making frequency table")
frequency = submitted.create_frequency_table(train)
print("Removing stopwords")
nonstop = submitted.remove_stopwords(frequency)
print("Calculating likelihood")
likelihood = submitted.laplace_smoothing(nonstop, 0.001)
print("Generating hypotheses")
hypotheses = submitted.naive_bayes(texts, likelihood, [1./2, 1./2], ['human', 'llm'])

for y in ['human','llm','undecided']:
    print("There are",hypotheses.count(y),'examples that were labeled with class',y)
Making frequency table
Removing stopwords
Calculating likelihood
Generating hypotheses
There are 1497 examples that were labeled with class human
There are 167 examples that were labeled with class llm
There are 0 examples that were labeled with class undecided
In [20]:
print("The accuracy of the classifier on the validation set is:")

count_correct = 0
for (y,yhat) in zip(labels, hypotheses):
    if y==yhat:
        count_correct += 1
        
print(count_correct / len(labels))
The accuracy of the classifier on the validation set is:
0.5751201923076923

This is less good than before. Let's see if we can improve it! The performance of the model is heavily influenced by two parameters that can't be measured from the training data:

  1. The prior, $P(\text{Class}=\text{pos})$. The training and testing data might have different priors, so estimating this from the training data is suboptimal.
  2. The Laplace smoothing parameter, $k$.

Since these two parameters can't be (correctly) estimated from the training data, they are called hyperparameters. Hyperparameters are usually determined based on your knowledge about the problem, or by running a lot of experiments to see which values give the best result on the development test data.

The function you'll write in this part of the MP is called optimize_hyperparameters.

In [21]:
importlib.reload(submitted)
help(submitted.optimize_hyperparameters)
Help on function optimize_hyperparameters in module submitted:

optimize_hyperparameters(texts, labels, classes, nonstop, priors, smoothnesses)
    Parameters:
    texts (list of lists) - dev set texts
        - texts[i][k] = k'th token of i'th text
    labels (list) - dev set labels
        - labels[i] = class label of i'th text
    classes (list of strings)
        - the classes represented in the likelihood dict
    nonstop (dict of Counters)
        - nonstop[y][x] = frequency of word x in class y, x not stopword
    priors (list of lists)
        - a list of different possible sets of values of the prior
    smoothnesses (list)
        - a list of different possible values of the smoothness

    Output:
    accuracies (numpy array, shape = len(priors) x len(smoothnesses))
        - accuracies[m,n] = dev set accuracy achieved using the
          m'th candidate prior and the n'th candidate smoothness

Let's use this function to test some different candidate values for the prior and the smoothness. The values we test are a little arbitrary, but let's try the following:

In [22]:
importlib.reload(submitted)
import numpy as np

priors = [(0.5,0.5), (0.65,0.35), (0.75,0.25)]
smoothnesses = [0.0001,0.001,0.01,0.1]
accuracies = submitted.optimize_hyperparameters(texts,labels,['human', 'llm'],nonstop,priors,smoothnesses)

(m,n) = np.unravel_index(np.argmax(accuracies), accuracies.shape)
print("The best accuracy achieved was",accuracies[m,n])
print("It was achieved for a prior of", priors[m], "and a smoothness of", smoothnesses[n])
The best accuracy achieved was 0.6814903846153846
It was achieved for a prior of (0.75, 0.25) and a smoothness of 0.01

Grade your homework¶

If you've reached this point, and all of the above sections work, then you're ready to try grading your homework! Before you submit it to Gradescope, try grading it on your own machine. This will run some visible test cases (which you can read in tests/test_visible.py), and compare the results to the solutions (which you can read in solution.json).

The exclamation point (!) tells python to run the following as a shell command. Obviously you don't need to run the code this way -- this usage is here just to remind you that you can also, if you wish, run this command in a terminal window.

In [23]:
!python grade.py
.....
----------------------------------------------------------------------
Ran 5 tests in 40.366s

OK

If you got any 'E' marks, it means that your code generated some runtime errors, and you need to debug those.

If you got any 'F' marks, it means that your code ran without errors, but that it generated results that are different from the solutions in solutions.json. Try debugging those differences.

If neither of those things happened, and your result was a series of dots, then your code works perfectly.

If you're not sure, you can try running grade.py with the -j option. This will produce a JSON results file, in which the best score you can get is 50.

Now you should try uploading submitted.py to Gradescope.

Gradescope will run the same visible tests that you just ran on your own machine, plus some additional hidden tests. It's possible that your code passes all the visible tests, but fails the hidden tests. If that happens, then it probably means that you hard-coded a number into your function definition, instead of using the input parameter that you were supposed to use. Debug by running your function with a variety of different input parameters, and see if you can get it to respond correctly in all cases.

Once your code works perfectly on Gradescope, with no errors, then you are done with the MP. Congratulations!