CS 440/ECE 448
Fall 2023
Margaret Fleck
Quiz 1 skills list
The quiz will cover material on Probability, Naive Bayes, and related topics such as testing classifiers
and cleaning text data.
Historical and other trivia
We've seen a lot of trivia, most of it not worth memorizing. The following
items are the exceptions. Be able to explain (very briefly) what they are and (approximately)
what time period they come from.
- McCulloch and Pitts
- Fred Jelinek
- Pantel and Lin (SpamCop)
- Boulis and Ostendorf
- The Plato System
Probability
- Random variables, axioms of probability
- Joint, marginal, conditional probability
Modelling text data
- Word types vs. word tokens
- The Bag of Words model
- Bigrams, ngrams
- Data cleaning:
- tokenization
- stemming (including Julie Lovins and Martin Porter)
- making units of useful size: dividing words or grouping characters
- Specal types of words and how we might handle them
- stop words
- rare words
- hapax legomena
- filler
- backchannel
- function vs. content
Testing
- Roles of training, development, test datasets.
- Evaluation metrics for classification (true positive rate, accuracy, recall, confusion matrix, ...)
Naive Bayes
Basic definitions and mathematical model:
- Bayes rule
- Likelihood, prior, posterior
- argmax operator
- Independence and conditional independence
- Maximum a posteriori (MAP) esimate, Maximum likelihood (ML) estimate,
factoring out P(evidence)
- How does prior affect these estimates?
- How do we combine several conditionally independent pieces of evidence
into one estimate of P(cause | evidence)?
- How do we choose the best value for the cause/class?
- How does the size of a Naive Bayes model compare to a full joint distribution?
- Why does it matter that Naive Bayes reduces the number of parameters we need to estimate?
Applying Naive Bayes to text classification
- MAP and MLE versions of the estimation equations
- Estimating probabilities from data
- Avoiding underflow (log transforms)
- Avoiding overfitting
- Smoothing
- Why is it important?
- Laplace smoothing
- Deleted estimation
- Ngram smoothing (high level ideas only)
- Headline results: spam detection (SpamCop, Pantel and Lin),
gender classification (Boulis and Ostendorf)