Processing math: 0%

CS 440/ECE 448
Margaret Fleck

Naive Bayes

A key task in building AI systems is classification, e.g. identifying what's in a picture. We'll start by looking at statistical methods of classification, particularly Naive Bayes.

Motivation for statistical methods

Here are two types of ivy common in the northeastern US:

Five-finger ivy on the left and poison ivy on the right.

Poison ivy causes an itchy rash, so it's obviously important to be able to tell it from other types of ivy. The exact color isn't a reliable cue because it varies a lot even within the same species. The texture on the leaf margins is a bit different. But the cue that children are taught to use is that poison ivy has three leaflets in each cluster.

Here are two CS faculty members. How would you describe the difference between Eric Shaffer and Paris Smaragdis in a way that would allow someone else to tell them apart, remembering that the color of their shirts isn't stable over time?

Eric Shaffer (left) and Paris Smaragdis (right)

What makes this hard?

We can use discrete, logic-like features on simple situations. E.g. poison ivy has three leaflets in each group. However, our two faculty members share many easily-articulated features, e.g. both have gray hair. This style of feature doesn't generalize well to many real-world situations.

too complex on realistic situations
too much need for hand crafting representations
too much need for expert knowledge

In practice, this means that models based on yes/no answers to discrete features are incomplete and fragile A better approach is to learn models from examples, e.g. pictures, audio data, big text collections. However, this input data presents a number of challenges:

It has uncertainty (e.g. from sensors).
It has errors (e.g. "Department of Computer Silence").
It is incomplete, often selectively so. E.g. we have much more training data for English than Zulu.
It may lack annotation (e.g. explicit type labels).

So we use probabilistic models to capture what we know, and how well we know it.

Where do probabilities come from?

Probabilities can be estimated from direct observation of examples (e.g. large corpus of text data). Constraints on probabilities may also come from sceientific beliefs

Theory connects observable probabilities to ones we can't observe. (e.g. icard access fails --> icard broken or access list wrong, because EngIT told me)
The underlying model must have some simple form (e.g. virus infections grow exponentially) which can be used to fill in data from incomplete observations.
Anything is possible, i.e. P(A) > 0. (We'll see smoothing techniques that fill zero values with small numbers.)
Nothing is guaranteed, i.e. P(A) < 1.

The assumption that anything is possible is usually made to deal with the fact that our training data is incomplete, so we need to reserve some probability for all the possible events that we haven't happened to see yet.

Example: campus bikes

Red and teal veoride bikes from /u/civicsquid reddit post

We have two types of bike: Veoride and private. (Bird bikes are supposed to arrive soon.) Veorides are mostly teal (with a few exceptions), but other (privately owned) bikes rarely are.

Joint distribution for these two variables

                 veoride        private      |
    teal            0.19          0.02       |   0.21
    red             0.01          0.78       |   0.79
                 ----------------------------
                    0.20          0.80

For this toy example, we can fill out the joint distribution table.

Bayes Rule

Remember that P(A | C) is the probability of A in a context where C is true. For example, the probability of a bike being teal increases dramatically if we happen to know that it's a veobike.

P(teal) = 0.21
P(teal | veoride) = 0.95 (19/20)

The formal definition:of conditional probability is

P(A | C) = P(A,C)/P(C)

Let's write this in the equivalent form P(A,C) = P(C) * P(A | C).

Flipping the roles of A and C (seems like middle school but just keep reading):

equivalently P(C,A) = P(A) * P(C | A)

P(A,C) and P(C,A) are the same quantity. (AND is commutative.) So we have

P(A) * P(C | A) = P(A,C) = P(C) * P(A | C)

P(A) * P(C | A) = P(C) * P(A | C)

So we have Bayes rule

P(C | A) = P(A | C) * P(C) / P(A)

Here's how to think about the pieces of this formula

       P(cause | evidence) =  P(evidence | cause) *  P(cause) / P(evidence)

         posterior            likelihood             prior     normalization

The normalization doesn't have a proper name because we'll typically organize our computation so as to eliminate it. Or, sometimes, we won't be able to measure it directly, so the number will be set at whatever is required to make all the probability values add up to 1.

Example: we see a teal bike. What is the chance that it's a veoride? We know

P(teal | veoride) = 0.95 [likelihood]
P(veoride) = 0.20 [prior]
P(teal) = 0.21 [normalization]

So we can calculate

P(veoride | teal) = 0.95 x 0.2 / 0.21 = 0.19/0.21 = 0.905

Why will this be useful?

In non-toy examples, it will be easier to get estimates for P(evidence | cause) than direct estimates for P(cause | evidence).

P(evidence | cause) tends to be stable, because it is due to some underlying mechanism. In this example, veoride likes teal and regular bike owners don't. In a disease scenario, we might be looking at the tendency of measles to cause spots, which is due to a property of the virus that is fairly stable over time.

P(cause | evidence) is less stable, because it depends on the set of possible causes and how common they are right now.

For example, suppose that a critical brake problem is discovered and Veoride suddenly has to pull bikes into the shop for repairs. So then only 2% of the bikes are veorides. Then we might have the following joint distribution. Now a teal bike is more likely to be private rather than veoride.

                 veoride        private      |
    teal            0.019          0.025     |   0.044
    red             0.001          0.955     |   0.956
                 ----------------------------
                    0.02           0.98

Dividing P(cause | evidence) into P(evidence | cause) and P(cause) allows us to easily adjust for changes in P(cause).

The MAP estimate

Recall Bayes Rule:

       P(cause | evidence) =  P(evidence | cause) *  P(cause) / P(evidence)

       posterior            likelihood             prior     normalization

Example: We have two underlying types of bikes: veoride and private. We observe a teal bike. What type should we guess that it is?

"Maximum a posteriori" (MAP) estimate chooses the type with the highest posterior probability:

pick the type X such that P(X | evidence) is highest

Suppose that type X comes from a set of types T. Then MAP chooses

$\arg\!\max_{X \in T} P(X | \text{evidence})$

The argmax operator iterates through a series of values for the dummy variable (X in this case). When it determines the maximum value for the quantity (P(X | evidence) in this example), it returns the value for X that produced that maximum value.

Completing our example, we compute two posterior probabilities:

P(veoride | teal) = 0.905
P(private | teal) = 0.095

The first probability is bigger, so we guess that it's a veoride. Notice that these two numbers add up to 1, as probabilities should.

Ignoring the normalizing factor

P(evidence) can usually be factored out of these equations, because we are only trying to determine the relative probabilities. For the MAP estimate, in fact, we are only trying to find the largest probability. In our equation

P(cause | evidence) = P(evidence | cause) * P(cause) / P(evidence)

P(evidence) is the same for all the causes we are considering. Said another way, P(evidence) is the probability that we would see this evidence if we did a lot of observations. But our current situation is that we've actually seen this particular evidence and we don't really care if we're analyzing a common or unusual situation.

So Bayesian estimation often works with the equation

P(cause | evidence) $\propto$ P(evidence | cause) * P(cause)

where we know (but may not always say) that the normalizing constant is the same across various such quantities.

Specifically, for our example

P(veoride | teal) = P(teal | veoride) * P(veoride) / P(teal)
P(private | teal) = P(teal | private) * P(private) / P(teal)

P(teal) is the same in both quantities. So we can remove it, giving us

P(veoride | teal) $\propto$ P(teal | veoride) * P(veoride) = 0.95 * 0.2 = 0.19
P(private | teal) $\propto$ P(teal | private) * P(private) = 0.025 * 0.8 = 0.02

So veoride is the MAP choice, and we never needed to know P(teal). Notice that the two numbers (0.19 and 0.02) don't add up to 1. So they aren't probabilities, even though their ratio is the ratio of the probabilities we're interested in.

The MLE estimate

As we saw above, our joint probabilities depend on the relative frequencies of the two underlying causes/types. Our MAP estimate reflects this by including the prior probability.

If we know that all causes are equally likely, we can set all P(cause) to the same value for all causes. In that case, we have

P(cause | evidence) $\propto$ P(evidence | cause)

So we can pick the cause that maximizes P(evidence | cause). This is called the "Maximum Likelihood Estimate" (MLE).

The MLE estimate can be very inaccurate if the prior probabilities of different causes are very different. On the other hand, it can be a sensible choice if we have poor information about the prior probabilities.

Multiple features

Red and teal veoride bikes from /u/civicsquid reddit post

Suppose we are building a more complete bike recognition system. It might observe (e.g. with its camera) features like:

where the bike is parked
whether it has a kickstand.
whether it has a filled triangular part
"Veoride" written on the frame

So our classification alogorithm looks like:

Input: a collection of objects of several types
Extract features from each object
Output: Use the features to label each object with its correct type

Bag of words text classification

For example, we might wish to take a collection of documents (i.e. text files) and classify each document by type. For example:

spam or not spam
politics vs. science vs. recipe
British vs. American English

Let's look at the British vs. American classification task. Certain words are characteristic of one dialect or the other. For example

UK: boffin, petrol, chips, lorry, queue, pushchair
US: egghead, gas, French fries, truck, line, stroller

See this list of British and American words. The word "boffin" differs from "egghead" because it can be modified to make a wide range of derived words like "boffinry" and "astroboffin". Dialects have special words for local foods, e.g. the South African English words "biltong," "walkie-talkie," "bunny chow."

Bag of words model: We can use the individual words as features.

A bag of words model determines the class of a document based on how frequently each individual word occurs. It ignores the order in which words occur, which ones occur together, etc. So it will miss some details, e.g. the difference between "poisonous" and "not poisonous,"

Challenges with multiple features

How do we combine these types of evidence into one decision about whether the bike is a veoride?

However, when we have more variables (typical of AI problems), this isn't practical. The joint distribution contains

too many probability values to estimate ( $2^n$ values for n variables, each with two possible values)
too many probability values to store

The estimation is the more important problem because we usually have only limited amounts of data, esp. clean annotated data. Many value combinations won't be observed simply because we haven't seen enough examples.

As we saw earlier, a large number of parameters would be required to model all of the $2^n$ interactions between n variables. The Naive Bayes algorithm reduces the number of parameters to O(n) by making assumptions about how the variables interact,

Conditional independence

Remember that two events A and B are independent iff

P(A,B) = P(A) * P(B)

Sadly, independence rarely holds. A more useful notion is conditional independence. That is, are two variables independent when we're in some limited context. Formally, two events A and B are conditionally independent given event C if and only if

P(A, B | C) = P(A|C) * P(B|C)

Conditional independence is also unlikely to hold exactly, but it is a sensible approximation in many modelling problems. For example, having a music stand is highly correlated with owning a violin. However, we may be able to treat the two as independent (without too much modelling error) in the context where the owner is known to be a musician.

The definiton of conditional independence is equivalent to either of the following equations:

P(A | B,C) = P(A | C)
P(B | A,C) = P(B | C)

It's a useful exercise to prove that this is true, because it will help you become familiar with the definitions and notation.

Basic Naive Bayes model

Suppose we're observing two types of evidence A and B related to cause C. So we have

P(C | A,B) $\propto$ P(A,B | C) * P(C)

Suppose that A and B are conditionally independent given C. Then

P(A, B | C) = P(A|C) * P(B|C)

Substituting, we get

P(C | A,B) $\propto$ P(A|C) * P(B|C) * P(C)

This model generalizes in the obvious way when we have more than two types of evidence to observe. Suppose our cause C has n types of evidence, i.e. random variables $E_1,...,E_n$ . Then we have

$P(C | E_1,...E_n) \ \propto\ P(E_1|C) * ....* P(E_n|C) * P(C) \ = \ P(C) * \prod_{k=1}^n P(E_k|C)$

So we can estimate the relationship to the cause separately for each type of evidence, then combine these pieces of information to get our MAP estimate. This means we don't have to estimate (i.e. from observational data) as many numbers. For n variables, each with two possible values (e.g. red vs teal), a naive Bayes model has only O(n) parameters to estimate, whereas the full joint distribution has $2^n$ parameters.

What is a word? (Version 1)

Consider this sentence

The big cat saw the small cat.

It contains two pairs of repeated words. For example, $W_3$ is "cat" and $W_7$ is also "cat". Our estimation process will consider the two copies of "cat" separately

Jargon:

word type (aka unique word): a dictionary entry such as "cat"
word token: a word in a specific position in the text

So our example sentence has seven tokens drawn from five types.

Depending on the specific types of text, it may be useful to normalize the words. For example

Remove capitalization
Remove very common words "stop words"
Run the words through a "stemmer"

Stemming

Stemming converts inflected forms of a word to a single standardized form. It removes prefixes and suffixes, leaving only the content part of the word (its "stem"). For example

       help, helping, helpful --->  help
       happy, happiness --> happi

Output is not always an actual word.

Stemmers are surprisingly old technology.

first one: Julie Beth Lovins 1968 (!!)
standard one: Martin Porter 1980

Porter made modest revisions to his stemmer over the years. But the version found in nltk (a currently popular natural language toolkit) is still very similar to the original 1980 algorithm.

Text classification with Naive Bayes.

Suppose that cause C produces n observable effects $E_1,...,E_n$ . Then our Naive Bayes model calculates the MAP estimate as

$P(C | E_1,...E_n) \ \ \propto\ \ P(E_1|C) * ....* P(E_n|C) * P(C)$

Rewriting slightly:

$P(C | E_1,...E_n) \ \ \propto\ \ P(C) * \prod_{k=1}^n P(E_k|C)$

Let's map this idea over to our Bag of Words model of text classification. Suppose our input document contains the string of words $W_1, ..., W_n$ . So $W_k$ is the word token at position k. Suppose that $P(W_k)$ is the probability of the corresponding word type.

To classify the document, naive Bayes compares these two values

$P(\text{UK} | W_1, \ldots, W_n) \ \ \ \propto\ \ \ P(\text{UK}) \ * \ \prod_{k=1}^n P(W_k | \text{UK} )$
$P(\text{US} | W_1, \ldots, W_n) \ \ \ \propto\ \ \ P(\text{US}) \ * \ \prod_{k=1}^n P(W_k | \text{US})$

To do this, we need to estimate two types of parameters

the likelihoods P(W | UK) and P(W | US) for each word type W.
the priors P(UK) and P(US)

Headline results

Naive Bayes works quite well in practice, especially when coupled with a number of tricks that we'll see later. The conditional independence assumption is accurate enough for many applications and means that we have relatively few parameters to estimate from training data. The Naive Bayes spam classifier SpamCop, by Patrick Pantel and Dekang Lin (1998) got down to a 4.33% error rate using bigrams and some other tweaks.

A Quantitative Analysis of Lexical Differences Between Genders in Telephone Conversations Constantinos Boulis and Mari Ostendorf, ACL 2005

Fisher corpus of more than 16,000 transcribed telephone conversations (about 10 minutes each). Speakers were told what subject to discuss. Can we determine gender of the speaker from what words they used?

Words most useful for discriminating speaker's gender (i.e. used much more by men than women or vice versa)

men dude [expletive] [expletive] wife wife's matt steve bass ben [expletive]
women husband husband's refunding goodness boyfriend coupons crafts linda gosh cute

This paper compares the accuracy of Naive Bayes to that of an SVM classifier, which is a more sophisticated classification technique.

Naive Bayes 83%
SVM 88.6%

The more sophisticated classifier makes an improvement, but not a dramatic one.

Training and testing

Training data: estimate the probability values we need for Naive Bayes

Development data: tuning algorithm details and high-level parameters

Test data: final evalution (not always available to developer) or the data seen only when system is deployed

Training data only contains a sampling of possible inputs, hopefully typically but nowhere near exhaustive. So development and test data will contain items that weren't in the training data. E.g. the training data might focus on apples where the development data has more discussion of oranges. Development data is used to make sure the system isn't too sensitive to these differences. The developer typically tests the algorithm both on the training data (where it should do well) and the development data. The final test is on the test data.

Simplistic parameter estimation

Now, let's look at estimating the probabilities. Suppose that we have a set of documents from a particular class C (e.g. CLA English) and suppose that W is a word type. We can define:

count(W) = number of times W occurs in the documents
n = number of total words in the documents

A naive estimate of P(W|C) would be $P(W | \text{C}) = \frac{\text{count}(W)}{n}$ . This is not a stupid estimate, but a few adjustments will make it more accurate.

Problem 1: underflow

The probability of most words is very small. For example, the word "Markov" is uncommon outside technical discussions. (It was mistranscribed as "mark off" in the original Switchboard corpus.) Worse, our estimate for P(W|C) is the product of many small numbers. Our estimation process can produce numbers too small for standard floating point storage.

To avoid numerical problems, researchers typically convert all probabilities to logs. That is, our naive Bayes algorithm will be maximizing

$\log(P(\text{C} | W_1,\ldots W_k))\ \ \propto \ \ \log(P(\text{C})) + \sum_{k=1}^n \log(P(W_k | \text{C}) )$

Notice that when we do the log transform, we need to replace multiplication with addition.

Problem 2: overfitting the training data

Recall that we train on one set of documents and then test on a different set of documents. For example, in the Boulis and Ostendorf gender classifcation experiment, there were about 14M words of training data and about 3M words of test data.

Training data doesn't contain all words of English and is too small to get a representative sampling of the words it contains. So for uncommon words:

Words that didn't appear in the training data get estimated zero probability.
Words that were uncommon in the training data get inaccurate estimates:

Zeroes destroy the Naive Bayes algorithm. Suppose there is a zero on each side of the comparison. Then we're comparing zero to zero, although the other numbers may give us considerable information about the merits of the two outputs. We use "smoothing" algorithms to assign non-zero probabiities to unseen words.

Laplace smoothing

Laplace smoothing is a simple technique for addressing the problem:

All unseen words are represented by a single word UNK. So the probability of UNK should represent all of these words collectively.

n = number of words in our Class C training data
count(W) = number of times W appeared in Class C training data

To add some probability mass to UNK, we increase all of our estimates by $\alpha$ , where $\alpha$ is a positive tuning constant, usually (but not always) no larger than 1. So our probabilities for UNK and for some other word W are:

P(UNK | C) = $\alpha \over n$
P(W | C) = ${count(W) + \alpha} \over n$

This isn't quite right. Probabilities must add up to 1 and these numbers add up to $n + \alpha(V+1)$ where V = number of word types seen in training data So we revise the denominators to make this be the case. Our final estimates of the probabilities are

V = number of word TYPES seen in training data
P(UNK | C) = $\frac {\alpha}{ n + \alpha(V+1)}$
P(W | C) = $\frac {\text{count}(W) + \alpha}{ n + \alpha(V+1)}$

When implementing Naive Bayes, notice that smoothing should be done separately for each class of document. For example, suppose you are classifying documents as UK vs. US English. Then Laplace smoothing for the UK English n and V would be the number of word tokens and word types seen in UK training data.