CS440 Quiz 2

CS 440/ECE 448
Fall 2023
Margaret Fleck

Quiz 1 skills list

The quiz will cover material on Probability, Naive Bayes, and related topics such as testing classifiers and cleaning text data.

Historical and other trivia

We've seen a lot of trivia, most of it not worth memorizing. The following items are the exceptions. Be able to explain (very briefly) what they are and (approximately) what time period they come from.

McCulloch and Pitts
Fred Jelinek
Pantel and Lin (SpamCop)
Boulis and Ostendorf
The Plato System

Probability

Random variables, axioms of probability
Joint, marginal, conditional probability

Modelling text data

Word types vs. word tokens
The Bag of Words model
Bigrams, ngrams
Data cleaning:
- tokenization
- stemming (including Julie Lovins and Martin Porter)
- making units of useful size: dividing words or grouping characters
Specal types of words and how we might handle them
- stop words
- rare words
- hapax legomena
- filler
- backchannel
- function vs. content

Testing

Roles of training, development, test datasets.
Evaluation metrics for classification (true positive rate, accuracy, recall, confusion matrix, ...)

Naive Bayes

Basic definitions and mathematical model:

Bayes rule
Likelihood, prior, posterior
argmax operator
Independence and conditional independence
Maximum a posteriori (MAP) esimate, Maximum likelihood (ML) estimate, factoring out P(evidence)
How does prior affect these estimates?
How do we combine several conditionally independent pieces of evidence into one estimate of P(cause | evidence)?
How do we choose the best value for the cause/class?
How does the size of a Naive Bayes model compare to a full joint distribution?
Why does it matter that Naive Bayes reduces the number of parameters we need to estimate?

Applying Naive Bayes to text classification

MAP and MLE versions of the estimation equations
Estimating probabilities from data
Avoiding underflow (log transforms)
Avoiding overfitting
Smoothing
- Why is it important?
- Laplace smoothing
- Deleted estimation
- Ngram smoothing (high level ideas only)
Headline results: spam detection (SpamCop, Pantel and Lin), gender classification (Boulis and Ostendorf)