MP4: Hidden Markov Models

In this lab, you'll train and test hidden Markov models in order to recognize spoken digits.

We're going to use python-style numbering in this MP. So:

We're going to assume, throughout this MP, that $$\pi_i=\left\{\begin{array}{ll}1&i=0\\0&\mbox{otherwise}\end{array}\right.$$

With that assumption, the only parameters that we need to learn, for the $y^{\textrm{th}}$ word, are

$$\Lambda_y=\left\{a_{ij},\mu_{id},\Sigma_{icd}:0\le i,j<\mbox{nstates},0\le c,d<\mbox{nceps}\right\}$$

.

The data for this MP are drawn from the Free Spoken Digit Dataset, whose contributors are listed here.

Since that dataset is CC-BY-SA 4.0, and since this MP distribution includes some data from there, this MP is also CC-BY-SA 4.0. That means that if you revise and redistribute it, your revision should also be CC-BY-SA, or a compatible license.

Please consider downloading the original dataset, and contributing your own voice to the dataset!


Part 0: Loading and playing the audio examples.

The audio are provided for you in the file data.hdf5, because that seemed like the easiest way to distribute them. If you prefer, you can download the original data; we're just using all examples of waveforms "1", "2", and "3". We're taking lucas to be the dev speaker, and yweweler to be the test speaker; all others are training speakers.

This file uses HDF5 hierarchical groups to organize its contents. Here is how you can access the contents:

Let's plot the first waveform of each class.

OK, now let's plot a spectrogram. We'll use 25ms frames, overlapping by 15ms (the sampling rate is 8kHz).

In these spectrograms,

You can see that

In order to keep the formant information (which tells you about the word), and discard pitch information (which tells you about the speaker), we can do something called "liftering". Liftering a spectrogram can be done by computing its Fourier transform (called the "cepstrum"), then windowing the cepstrum, and transforming it back into a spectrum, as shown in the next block.

Either the windowed cepstrum or the liftered spectrum will be reasonable representations to use for speech recognition, so they are provided for you as a utility function in submitted.py, as the method compute_features.


Part 1: Initialize the HMM

Expectation-Maximization finds a local maximum of the likelihood function, not a global maximum. The model parameters you end up with are guaranteed to be better than the parameters you start with, but that doesn't mean they're guaranteed to be good: the old maxim "garbage in, garbage out" applies.

In order to start with reasonable parameters, a pretty reasonable heuristic is as follows:

  1. Divide each spectrogram into nstates segments of uniform duration.
  2. Initialize each state using the frames that have been assigned to its state.

Here is a docstring describing this process:

The training data contains examples for three different words: '1', '2', and '3'. Let's initialize each HMM using the data from the corresponding word.

It's not clear if we want to use the cepstral features, or the spectral features. For now, let's use the cepstra.

For example, if the function works, here's a way we can put all of the parameters into a single dict:

It's easy to see that the $A$ matrices are OK, by scrolling through the above. In order to see if the $\mu$ vectors are reasonable, let's compare them to the global average cepstrum.

Similarly, let's plot the standard deviations (square roots of the main diagonal of each covariance), to see if those are reasonable.

Finally, let's check the eigenvalues of each covariance matrix, to make sure they're also reasonable.

The following code blocks are inserted to help you debug. These compare your results, in each of the above areas, to the results saved in solutions.hdf5. If there are any results that are not zero, you should try to figure out why.


Part 2: Calculate the observation pdf

The observation pdf is a multivariate normal distribution. For numerical stability, we floor it by $10^{-100}$, just to make sure that none of the values get smaller than that.

$$b_i(\vec{x}_t) = \max\left(\frac{1}{(2\pi)^{D/2}|\Sigma_i|^{1/2}}e^{-\frac{1}{2}(\vec{x}_t-\vec\mu_i)^T\Sigma_i^{-1}(\vec{x}_t-\vec\mu_i)}, 10^{-100}\right)$$

You can program this yourself, but it's also OK to use scipy.stats.multivariate_normal. The autograder will have scipy available this time.

Notice that what we're doing is replacing each $X$ matrix (these are size-(nframes,nceps) matrices, with one cepstral vector per frame) by a $B$ matrix (these are size-(nframes,nstates) matrices, in which B[t,i] contains $b_i(\vec{x}_t)$ for the $n^\textrm{th}$ utterance).

The $B$ matrix has a huge, huge dynamic range. If you try to visualize the raw probability, most of what you see will look like zero. Instead, it is more useful to visualize the log probability, as shown here:

Here's some code to help you debug, by showing the percentage difference between your solutions and the solutions in solutions.hdf5 in the first example of the word "1".


Part 3: Scaled Forward Algorithm

The first step in training the HMM is the scaled forward algorithm. It computes two variables, $\hat\alpha_t(i)$ and $g_t$. The definitions of these variables are:

$$\hat\alpha_t(i) = p(q_t=i|\vec{x}_1,\ldots,\vec{x}_t,\Lambda)$$$$g_t = p(\vec{x}_t|\vec{x}_1,\ldots,\vec{x}_{t-1},\Lambda)$$

These can be computed using the scaled forward algorithm:

  1. Initialize (note, we assume $\pi_0=1$, and $\pi_i=0\forall i\ne 0$): $$\tilde\alpha_1(i)=\pi_i b_i(\vec{x}_1)$$ $$g_1 = \sum_i\tilde\alpha_1(i)$$ $$\hat\alpha_1(i) = \frac{\tilde\alpha_1(i)}{g_1}$$

  2. Iterate $$\tilde\alpha_t(j)=\sum_{i=0}^{N-1}\hat\alpha_{t-1}(i)a_{ij}b_j(\vec{x}_t)$$ $$g_t = \sum_j\tilde\alpha_t(j)$$ $$\hat\alpha_t(j) = \frac{\tilde\alpha_t(j)}{g_t}$$

  3. Terminate (note: this part will be calculated later, in the function submitted.recognize) $$p(X|\Lambda) = \prod_{t=1}^T g_t$$

The vector $D$ lists the eigenvalues of the gram matrix, which are also the eigenvalues of the sum-of-squares matrix (the scaled covariance). If you add them, you get the total variance of all samples in the dataset.


Part 4: Scaled backward

Like the scaled forward algorithm, the scaling in the scaled backward algorithm is supposed to compensate for numerical underflow and numerical overflow errors. Unfortunately, with Gaussians, the standard scaled backward algorithm (from the Rabiner article) doesn't compensate well enough. Instead, let's use the following definition:

$$\hat\beta_t(i) = \frac{p(x_{t+1},\ldots,x_T|q_t=i,\Lambda)}{\max_j p(x_{t+1},\ldots,x_T|q_t=i,\Lambda)}$$

In order to avoid any numerical overflow or underflow, the normalization has to be done in every single time step, like this:

  1. Initialize: $$\hat\beta_T(i)=1$$

  2. Iterate $$\tilde\beta_t(i)=\sum_{j=0}^{N-1}a_{ij}b_j(\vec{x}_{t+1})\hat\beta_{t+1}(j)$$ $$c_t=\max_i\tilde\beta_t(i)$$ $$\hat\beta_t(i) = \frac{\tilde\beta_t(i)}{c_t}$$


Part 5: Find the state and segment posterior probabilities

The state posterior probability is $\gamma_t(i)=p(q_t=i|X,\Lambda)$. The segment posterior probability is $\xi_t(i,j)=p(q_t=i,q_{t+1}=j|X,\Lambda)$.


Part 6: E-Step

For the E-step in an HMM, we want to find the expected values of the numerators and denominators that will be used in the M-step. These are things like:

$$A_{num}[i,j] = \sum_t \xi_t(i,j)$$$$A_{den}[i,j] = \sum_j\sum_t\xi_t(i,j)$$$$\mu_{num}[i,:] = \sum_t\gamma_t(i)\vec{x}_t$$$$\mu_{den}[i]=\sum_t\gamma_t(i)$$$$\Sigma_{num}[i,:,:]=\sum_t\gamma_t(i)(\vec{x}_t-\vec\mu_i)(\vec{x}_t-\vec\mu_i)^T$$$$\Sigma_{den}[i]=\sum_t\gamma_t(i)$$

Notice a couple of weird things here:


Part 7: M-Step

The M-Step divides each numerator by its corresponding denominator, in order to re-estimate the model parameters.

The only new concept here is Tikhonov regularization. In order to keep $\Sigma_i$ from becoming singular, we add $\lambda I$ to it, where $I$ is the identity matrix, and $\lambda$ is a scalar constant called the regularizer:

$$\Sigma_i = \frac{\Sigma_{num}[i]}{\Sigma_{den}[i]} + \lambda I$$

If you're paying attention, you will notice that the new model parameters $\Lambda_{new}$ are pretty similar to the old model parameters! This probably means that our initial parameter estimates were close to a local optimum.


Part 8: Recognize Speech!!

In order to recognize speech, we want to calculate the log probability of each of our three models, and then choose the model that has the best log probability. If $X_n$ is the feature matrix for the $n^{\textrm{th}}$ devtest utterance, and $\Lambda_y$ is the model of word $y$, then:

$$\hat{y}_n = \arg\max_y p(X_n | \Lambda_y)$$

We can easily measure accuracy by keeping the results separated, depending on the reference label ('1', '2', or '3') of each devtest sample:


Extra Credit

You can earn up to 10% extra credit on this MP by finishing the file called extra.py, and submitting it to the autograder.

When you unpack the file mp2_extra.zip, it will give you the following files:

The extra credit assignment this time is to get the best speech recognition accuracy that you can!

You can test your accuracy like this: