CS440 Lectures

Processing math: 100%

CS 440/ECE 448
Fall 2019
Margaret Fleck

Lecture 15: Naive Bayes 2

Recap: naive Bayes document classification

Recall our document class ification problem from last time Our input document is a string of words (perhaps not all identical) $W_1, ..., W_n$

To classify the document, our naive Bayes, bag of words model, compares these two values

$P(\text{UK}) * \prod_{k=1}^n P(W_k | \text{UK} )$
$P(\text{US}) * \prod_{k=1}^n P(W_k | \text{US})$

We need to estimate the parameters P(W | UK) and P(W | US) for each word type W.

Forward Outline

We're going to look at three types of details needed to make this classification algorithm work well.

Defining the word tokens
Estimating the probabilities
Extending the model to sequences of adjacent words

Be aware that we're now talking about tweaks that might help for one application and hurt performance for another. "Your mileage may vary."

For a typical example of the picky detail involved in this data processing, see the method's section from the word cloud example in the previous lecture.

Defining the word tokens

Our first job is to produce a clean string of words, suitable for use in the classifier. Suppose we're dealing with a language like English. Then we would need to

divide at whitespace
normalize punctuation, html tags, capitalization, etc

This process is called tokenization. Format features such as punctuation and capitalization may be helpful or harmful, depending on the task. E.g. it may be helpful to know that "Bird" is different from "bird," because the former might be a proper name. But perhaps we'd like to convert "Bird" to "CAP bird" to make its structure explicit. So "normalize" may either involve throwing away the information or converting it into a format that's easier to process.

After that, it often helps to normalize the form of each word using a process called stemming.

Stemming

Stemming converts inflected forms of a word to a single standardized form. It removes prefixes and suffixes, leaving only the content part of the word (its "stem"). For example

       help, helping, helpful --->  help
       happy, happiness --> happi

Output is not always an actual word.

Stemmers are surprisingly old technology.

first one: Julie Beth Lovins 1968 (!!)
standard one: Martin Porter 1980

Porter made modest revisions to his stemmer over the years. But the version found in the current nltk is still very similar to the original 1980 algorithm.

Languages unlike English

English has relatively short words, with convenient spaces between them. Tokenization requires slightly different steps for languages with longer words or different writing system. For example,

Chinese: no spaces between charactesr. So we run a "word segmenter" that groups strings of 1-3 related characters into words.
German: complex words (esp. compounds) would be divided. E.g. "Wintermantel" might be divided into "winter" (winter) and "mantel" (coat).

Very frequent and very rare words

Classification algorithms frequently ignore very frequent and very rare words. Very frequent words ("stop words") often convey very little information about the topic. So they are typically deleted before doing a bag of words analysis.

Rare words include words for uncommon concepts (e.g. armadillo), proper names, foreign words, and simple mistakes (e.g. "kitcen"). Words that occur only one often represent a large fraction (e.g. 30%) of the word types in a dataset. So rare words cause two problems

Their probabilities are hard to estimate accurately. (See below.)
They are very numerous, so they make our data tables much larger.

Rare words may be deleted. More often, all all rare words are mapped into a single placeholder value (e.g. UNK). This allows us to treat them all as a single item, with a reasonable probability of being observed.

Simplistic parameter estimation

Now, let's look at estimating the probabilities for the clean word string. Suppose that we have a set of documents from a particular class (e.g. UK English),. and suppose that W is a word type. We can define:

count(W) = number of times W occurs in the documents
n = number of total words in the documents

A naive estimate of P(W|class) would be $P(W | \text{class}) = \frac{\text{count}(W)}{n}$ . This is not a stupid estimate, but a few adjustments will make it more accurate.

Problem 1: underflow

The probability of most words is very small. For example, the word "Markov" is uncommon outside technical discussions. (It was mistranscribed as "mark off" in the original Switchboard corpus.) Worse, our estimate for P(W|class) is the product of many small numbers. Our estimation process can produce numbers too small for standard floating point storage.

To avoid numerical problems, reserachers typically convert all probabilities to logs (and the products to sums). That is, our naive Bayes algorithm will be maximizing

$\log(P(\text{UK} | W_1,\ldots W_k))\ \ \propto \ \ \log(P(\text{UK})) + \sum_{k=1}^n \log(P(W_k | \text{UK}) )$

Problem 2: overfitting the training data

Recall that we train on one set of documents and then test on a different set of documents. For the gender classifcation experiment in the previous lecture, these were about 14M words and about 3M words respectively.

Training data doesn't contain all words of English and is too small to get a representative sampling of the words it contains. So for uncommon words:

Words that didn't appear in the training data get estimated zero probability.
Words that were uncommon in the training data get inaccurate estimates:

Smoothing

Smoothing assigns non-zero probabiities to unseen words. Since all probabilities have to add up to 1, this means we need to reduce the estimated probabilities of the words we have seen. Think of a probability distribution as being composed of stuff, so there is a certain amount of "probability mass" assigned to each word. Smoothing moves mass from the seen words to the unseen words.

Questions:

Which words should lose mass: mostly infrequent ones
Which words should gain mass? How much?

Some unseen words might be more likely than others, e.g. if it looks like a legit English word (e.g. boffin) it might be more likely than something using non-English characters (e.g. bête).

Smoothing is critical to the success of many algorithms, but current smoothing methods are witchcraft. We don't fully understand how to model the (large) set of words that haven't been seen yet, and how often different types might be expected to appear. Popular standard methods work well in practice but their theoretical underpinnings are suspect.

Laplace smoothing

Laplace smoothing is a simple technique for addressing the problem:

All unseen words are represented by a single word UNK. So the probability of UNK should represent all of these words collectively.

n = number of words in our UK training data
count(W) = number of times W appeared in UK training data
$\alpha$ is a tuning constant between 0 and 1 (typically small)
V = number of word TYPES seen in training data

To add some probability mass to UNK, we increase all of our estimates by $\alpha$ . So they are proportional to

P(UNK | UK) = $\alpha$
P(W | UK) = count(W) + $\alpha$

This isn't quite right, because probabilities need to add up to 1. So we revise the denominators to make this be the case. Our final estimates of the probabilities are

P(UNK | UK) = $\frac {\alpha}{ n + \alpha(V+1)}$
P(W | UK) = $\frac {\text{count}(W) + \alpha}{ n + \alpha(V+1)}$

Performance of Laplace smoothing

overestimates probability of unseen words
underestimates probability of common words

Deleted estimation

Also called cross-validation or held-out estimation.

Let's empirically map out how much counts change between two samples of text from the same source. So we divide our training data into two halves A and B. Suppose that each half contains n words. We compute estimates using counts from half A and the evaluate the accuracy of those estimates using the counts from half B.

These slides (from Mitch Marcus based on data from Mark Liberman) show that our counts for infrequent words overestimate their true probability. Estimates for common words are usually accurate.

Consider a fixed count r. Suppose

$W_1,...,W_k$ are the words that occur r times in half A.
$X_A$ = average count of $W_1,...,W_k$ in half B

Then we can use $X_A/n$ as estimate of the true probability for each word $W_i$ .

Tweaks on deleted estimation

Make the estimate symmetrical. Let's assume for simplicity that k is the same for both halves. Compute

$X_A$ as above
$X_B$ reversing the roles of the two datasets
$(X_A + X_B)/2$ = estimate of the true frequency for a word with observed count r

Effectiveness of this depends on how we split the training data in half. Text tends to stay on one topic for a while. For example, if we see "boffin," it's likely that we'll see a repeat of that word soon afterwards. Therefore

Splitting the data in the middle works poorly.
Better to split by even/odd sentences.

These corrected estimates still have biases, but not as bad as the original ones. The method assumes that our eventual test data will be very similar to our training data, which is not always true.

AI in action

AI algorithms used in government decision making don't always work as well as advertised.