CS 440/ECE 448

Margaret Fleck

## Hidden Markov Models 2

a real-world trellis (supporting a passionfruit vine)

## Formulating the estimation problem

Suppose that W is our input word sequence (e.g. a sentence) and T is
an input tag sequence. That is

\(W = w_1, w_2, \ldots w_n \)

\(T = t_1, t_2, \ldots t_n \)

Our goal is to find a tag sequence T that maximizes P(T | W).
Using Bayes rule, this is equivalent to maximizing \( P(W \mid T) * P(T)/P(W) \)
And this is equivalent to maximizing \( P(W \mid T) * P(T) \) because P(W) doesn't depend
on T.

So we need to estimate P(W | T) and P(T) for one fixed word
sequence W but for all choices for the tag sequenceT.

## What can we realistically estimate?

Obviously we can't directly measure the probability of an entire sentence
given a similarly long tag sequence. Using (say) the Brown corpus,
we can get reasonable estimates for

- individual word frequencies
- tag usage for each word
- tag bigrams

If we have enough tagged data, we can also estimate tag trigrams.
Tagged data is hard to get in quantity, because human intervention is needed
for checking and correction.

## Markov assumptions

To simplify the problem, we make "Markov assumptions." A Markov
assumption is that the value of interest depends only on a short
window of context. For example, we might assume that the probability
of the next word depends only on the two previous words. You'll often
see "Markov assumption" defined to allow only a single item of
context. However, short finite contexts can easily be simulated using
one-item contexts.

For POS tagging, we make two Markov assumptions. For each position
k:

\(P(w_k)\) depends only on \(P(t_k)\)

\(P(t_k) \) depends only on \(P(t_{k-1}) \)

Therefore

\( P(W \mid T) = \prod_{i=1}^n P(w_i \mid t_i)\)

\( P(T) = P(t_1 | START) * \prod_{i=1}^n P(t_k \mid t_{k-1}) \)

where \( P(t_1 | START) \) is the probability of tag \(t_1\) at the start of
a sentence.

Folding this all together,
we're finding the T that maximizes

\( P (T \mid W)\)

\(\propto P(W \mid T) * P(T) \)

\( = \prod_{i=1}^n P(w_i \mid t_i)\ \ * \ \
P(t_1 \mid \text{START})\ \ *\ \ \prod_{k=2}^n P(t_k \mid t_{k-1}) \)

## HMM picture

Like Bayes nets, HMMs are a type of graphical model and so we can use
similar pictures to visualize the statistical dependencies:

In this case, we have a sequence of tags at the top. Each
tag depends only on the previous tag. Each
word (bottom row) depends only on the corresonding tag.