CS 440/ECE 448
Margaret Fleck

Classification

Input: a collection of objects of several types
Output: correct type label for each object

In practice, this means

Input: a collection of objects of several types
Extract features from each object
Output: Use the features to label each object with its correct type

For example, we might wish to take a collection of documents (i.e. text files) and classify each document as

• spam or not spam
• politics vs. science vs. recipe
• British vs. American English

Bag of words model

Let's look at the British vs. American classification task. Certain words are characteristic of one dialect or the other. For example

• UK: boffin, petrol, chips, lorry, queue, pushchair
• US: egghead, gas, French fries, truck, line, stroller

See this list of British and American words.

The word "boffin" differs from "egghead" because it can easily be modified to make words like "boffinry" and "astroboffin".

Dialects have special words for local foods, e.g. the South African English words "biltong," "walkie-talkie," "bunny chow."

The word clouds at the top of this page show which words (content words only) were said most by Clinton (left) and Trump (right) in their first debate.

See more details at Martin Krzywinski's Word Analysis of 2016 Presidential Debates Bag of words model: We can use the individual words as features.

A bag of words model determines the class of a document based on how frequently each individual word occurs. It ignores the order in which words occur, which ones occur together, etc. So it will miss some details, e.g. the difference between "poisonous" and "not poisonous,"

Text classification with Naive Bayes.

Suppose that cause C produces n observable effects $$E_1,...,E_n$$. Then our Naive Bayes model calculates the MAP estimate as

$$P(C | E_1,...E_n) \ \propto\ P(E_1|C) * ....* P(E_n|C) * P(C) \ = \ P(C) * \prod_{k=1}^n P(E_k|C)$$

Let's map this idea over to our Bag of Words model of text classification. Suppose our input document contains the string of words $$W_1, ..., W_n$$

Notice that some of the words might be identical. E.g the sentence "The big cat saw the small cat" contains two pairs of repeated words. For example, $$W_2$$ is "cat" and $$W_7$$ is also "cat". Our estimation process will consider the two copies of "cat" separately

Jargon:

• word type (aka unique word): a dictionary entry such as "cat"
• word token: a word in a specific position in the text

So our example sentence has seven tokens drawn from five types.

To classify the document, naive Bayes compares these two values

$$P(\text{UK} | W_1, \ldots, W_n) \ \propto\ P(\text{UK}) * \prod_{k=1}^n P(W_k | \text{UK} )$$
$$P(\text{US} | W_1, \ldots, W_n) \ \propto\ P(\text{US}) * \prod_{k=1}^n P(W_k | \text{US})$$

To do this, we need to estimate two types of parameters

• the likelihoods P(W | UK) and P(W | US) for each word type W.
• the priors P(UK) and P(US)

Naive Bayes works quite well in practice, especially when coupled with a number of tricks that we'll see later. The Naive Bayes spam classifier SpamCop, by Patrick Pantel and Dekang Lin (1998) got down to a 4.33% error rate using bigrams and some other tweaks.

Suppose we have n word types, then creating our bag of words model requires estimating O(n) probabilities. This is much smaller than the full conditional probability table which has $$2^n$$ entries. The big problem with too many parameters is having enough training data to estimate them reliably. (I said this last lecture but it deserves a repeat.)

Gender classification example

A Quantitative Analysis of Lexical Differences Between Genders in Telephone Conversations Constantinos Boulis and Mari Ostendorf, ACL 2005

Fisher corpus of more than 16,000 transcribed telephone conversations (about 10 minutes each). Speakers were told what subject to discuss. Can we determine gender of the speaker from what words they used?

Most frequent words in any spoken corpus are very common words that are usually ignored in estimation, called "stop words." These include

• function words: the, a, you, in,....
• fillers: um, uh, I mean, so, ...
• backchannel: uh-uh, yeah

Stop words are often omitted in classification, because they don't tell us much about the topic. However, this is task dependent. Stop words can provide useful information about the speaker's mental state, relation to the listener, etc. See James Pennebaker "The secret life of pronouns".

Words most useful for discriminating speaker's gender (i.e. used much more by men than women or vice versa)

• men dude [expletive] [expletive] wife wife's matt steve bass ben [expletive]
• women husband husband's refunding goodness boyfriend coupons crafts linda gosh cute

This paper compares the accuracy of Naive Bayes to that of an SVM classifier, which is a more sophisticated classification technique.

• Naive Bayes 83%
• SVM 88.6%

The more sophisticated classifier makes an improvement, but not a dramatic one.