CS 440/ECE 448
Margaret Fleck

## Naive Bayes 3

Armadillo (picture from Prairie Rivers Network)

## Forward Outline

The high level classification model is simple, but a lot of details are required to get high quality results. Over the next three videos, we'll look at four aspects of this problem:

• Testing and evaluation
• Defining the word tokens
• Estimating the probabilities
• Extending the model to sequences of adjacent words

## The process of testing

Training data: estimate the probability values we need for Naive Bayes

Development data: tuning algorithm details

Test data: final evalution (not always available to developer) or the data seen only when system is deployed

Training data only contains a sampling of possible inputs, hopefully typically but nowhere near exhaustive. So development and test data will contain items that weren't in the training data. E.g. the training data might focus on apples where the development data has more discussion of oranges. Development data is used to make sure the system isn't too sensitive to these differences. The developer typically tests the algorithm both on the training data (where it should do well) and the development data. The final test is on the test data.

## Evaluation metrics

Results of classification experiments are often summarized into a few key numbers. There is often an implicit assumption that the problem is asymmetrical: one of the two classes (e.g. Spam) is the target class that we're trying to identify.

 Labels from Algorithm Spam True Positive (TP) False Negative (FN) False Positive (FP) True Negative (TN)

We can summarize performance using the rates at which errors occur:

• False positive rate = FP/(FP+TN) [how many wrong things are in the negative outputs]
• False negative rate = FN/(TP+FN) [how many wrong things are in the positive outputs]
• Accuracy = (TP+TN)/(TP+TN+FP+FN)
• Error rate = 1-accuracy

Or we can ask how well our output set contains all, and only, the desired items:

• precision (p) = TP/(TP+FP) [how many of our outputs were correct?]
• recall (r) = TP/(TP+FN) [how many of the correct answers did we find?]
• F1 = 2pr/(p+r)

F1 is the harmonic mean of precision and recall. Both recall and precision need to be good to get a high F1 value.

We can also display a confusion matrix, showing how often each class is mislabelled as a different class. These usually appear when there are more than two class labels. They are most informative when there is some type of normalization, either in the original test data or in constructing the table. So in the table below, each row sums to 100. This makes it easy to see that the algorithm is producing label A more often than it should, and label C less often.

 Labels from Algorithm A B 95 0 5 15 83 2 18 22 60

## Forward Outline

We're now going to look at details required to make this classification algorithm work well.

Be aware that we're now talking about tweaks that might help for one application and hurt performance for another. "Your mileage may vary."

For a typical example of the picky detail involved in this data processing, see the method's section from the word cloud example in the previous lecture.

## Defining the word tokens

Our first job is to produce a clean string of words, suitable for use in the classifier. Suppose we're dealing with a language like English. Then we would need to

• divide at whitespace
• normalize punctuation, html tags, capitalization, etc

This process is called tokenization. Format features such as punctuation and capitalization may be helpful or harmful, depending on the task. E.g. it may be helpful to know that "Bird" is different from "bird," because the former might be a proper name. But perhaps we'd like to convert "Bird" to "CAP bird" to make its structure explicit. Similarly, "tyre" vs. "tire" might be a distraction or an important cue for distinguishing US from UK English. So "normalize" may either involve throwing away the information or converting it into a format that's easier to process.

After that, it often helps to normalize the form of each word using a process called stemming.

## Stemming

Stemming converts inflected forms of a word to a single standardized form. It removes prefixes and suffixes, leaving only the content part of the word (its "stem"). For example

       help, helping, helpful --->  help
happy, happiness --> happi


Output is not always an actual word.

Stemmers are surprisingly old technology.

• first one: Julie Beth Lovins 1968 (!!)
• standard one: Martin Porter 1980

Porter made modest revisions to his stemmer over the years. But the version found in nltk (a currently popular natural language toolkit) is still very similar to the original 1980 algorithm.

## Languages unlike English

English has relatively short words, with convenient spaces between them. Tokenization requires slightly different steps for languages with longer words or different writing system. For example, Chinese writing does not put spaces between words. So "I went to the store" is written as a string of characters, each representing one syllable (below). So programs processing Chinese text typically run a "word segmenter" that groups adjacent characters (usually 2-3) into a word.

In German, we have the opposite problem: long words. So we might divide words like "Wintermantel" into "winter" (winter) and "mantel" (coat).

In either case, the output tokens should usually form coherent units, whose meaning cannot easily be predicted from smaller pieces.

## Very frequent and very rare words

Classification algorithms frequently ignore very frequent and very rare words. Very frequent words ("stop words") often convey very little information about the topic. So they are typically deleted before doing a bag of words analysis.

Rare words include words for uncommon concepts (e.g. armadillo), proper names, foreign words, and simple mistakes (e.g. "Computer Silence"). Words that occur only one often represent a large fraction (e.g. 30%) of the word types in a dataset. So rare words cause two problems

• Their probabilities are hard to estimate accurately.
• They are very numerous, so they make our data tables much larger.

Rare words may be deleted. More often, all all rare words are mapped into a single placeholder value (e.g. UNK). This allows us to treat them all as a single item, with a reasonable probability of being observed.

## N-gram models

The Bag of Words model uses individual words as features, ignoring the order in which they appeared. Text classification is often improved by using pairs of words (bigrams) as features. Some applications (notably speech recognition) use trigrams (triples of words) or even longer n-grams. E.g. "Tony the Tiger" conveys much more topic information than either "Tony" or "tiger". In the Boulis and Ostendorf study of gender classification bigram features proved significantly significantly more powerful.

For n words, there are $$n^2$$ possible bigrams. But our amount of training data stays the same. So the training data is even less adequate for bigrams than it was for single words.