MP4: Hidden Markov Models¶

In this lab, you'll train and test hidden Markov models in order to find phoneme segment boundaries in a large speech corpus.

The data for this MP are drawn from the LJSpeech corpus. Transcriptions have been converted to phonemes using phonemizer. Waveforms have been converted to MFCC using librosa.feature.mfcc, with settings of hop_length=10ms, win_length=25ms, n_mfcc=16, fmax=8000.

Warning: This Juyter notebook will use librosa extensively, but the autograder will not have librosa. All of the code in your submitted.py should use numpy only, not librosa.

  1. Transition Posterior Probabilities
  2. reestimate: find model parameters given transition posteriors
  3. Calculate the observation pdf
  4. Viterbi Training
  5. Show segmentations
  6. Extra Credit: Recognize Speech

Part 0: Loading the MFCC and transcripts.¶

The transcripts are provided to you in the file transcripts.json.

In [1]:
import json, random

with open('transcripts.json') as f:
    transcripts=json.load(f)

print(transcripts.keys())
dict_keys(['train', 'dev', 'eval', 'phoneset', 'senoneset'])
In [2]:
print('Training transcripts are available for %d files'%(len(transcripts['train'])))
k0 = list(transcripts['train'].keys())[0]
print('The first file is',k0)
print('Its word transcript is:',transcripts['train'][k0]['word'])
print('Its phone transcript is:',transcripts['train'][k0]['phone'])
print('Its senone transcript is:',transcripts['train'][k0]['senone'])
Training transcripts are available for 1000 files
The first file is LJ001-0001
Its word transcript is: Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition
Its phone transcript is: pɹɪntɪŋ, ɪnðɪ oʊnli sɛns wɪð wɪtʃ wi ɑɹ æt pɹɛzənt kənsɜnd, dɪfɚz fɹʌm moʊst ɪf nɑt fɹʌm ɔl ðɪ ɑɹts ænd kɹæfts ɹɛpɹɪzɛntᵻd ɪnðɪ ɛksɪbɪʃən
Its senone transcript is: $1 $2 $3 p1 p2 p3 ɹ1 ɹ2 ɹ3 ɪ1 ɪ2 ɪ3 n1 n2 n3 t1 t2 t3 ɪ1 ɪ2 ɪ3 ŋ1 ŋ2 ŋ3 $1 $2 $3 ɪ1 ɪ2 ɪ3 n1 n2 n3 ð1 ð2 ð3 ɪ1 ɪ2 ɪ3 # o1 o2 o3 ʊ1 ʊ2 ʊ3 n1 n2 n3 l1 l2 l3 i1 i2 i3 # s1 s2 s3 ɛ1 ɛ2 ɛ3 n1 n2 n3 s1 s2 s3 # w1 w2 w3 ɪ1 ɪ2 ɪ3 ð1 ð2 ð3 # w1 w2 w3 ɪ1 ɪ2 ɪ3 t1 t2 t3 ʃ1 ʃ2 ʃ3 # w1 w2 w3 i1 i2 i3 # ɑ1 ɑ2 ɑ3 ɹ1 ɹ2 ɹ3 # æ1 æ2 æ3 t1 t2 t3 # p1 p2 p3 ɹ1 ɹ2 ɹ3 ɛ1 ɛ2 ɛ3 z1 z2 z3 ə1 ə2 ə3 n1 n2 n3 t1 t2 t3 # k1 k2 k3 ə1 ə2 ə3 n1 n2 n3 s1 s2 s3 ɜ1 ɜ2 ɜ3 n1 n2 n3 d1 d2 d3 $1 $2 $3 d1 d2 d3 ɪ1 ɪ2 ɪ3 f1 f2 f3 ɚ1 ɚ2 ɚ3 z1 z2 z3 # f1 f2 f3 ɹ1 ɹ2 ɹ3 ʌ1 ʌ2 ʌ3 m1 m2 m3 # m1 m2 m3 o1 o2 o3 ʊ1 ʊ2 ʊ3 s1 s2 s3 t1 t2 t3 # ɪ1 ɪ2 ɪ3 f1 f2 f3 # n1 n2 n3 ɑ1 ɑ2 ɑ3 t1 t2 t3 # f1 f2 f3 ɹ1 ɹ2 ɹ3 ʌ1 ʌ2 ʌ3 m1 m2 m3 # ɔ1 ɔ2 ɔ3 l1 l2 l3 # ð1 ð2 ð3 ɪ1 ɪ2 ɪ3 # ɑ1 ɑ2 ɑ3 ɹ1 ɹ2 ɹ3 t1 t2 t3 s1 s2 s3 # æ1 æ2 æ3 n1 n2 n3 d1 d2 d3 # k1 k2 k3 ɹ1 ɹ2 ɹ3 æ1 æ2 æ3 f1 f2 f3 t1 t2 t3 s1 s2 s3 # ɹ1 ɹ2 ɹ3 ɛ1 ɛ2 ɛ3 p1 p2 p3 ɹ1 ɹ2 ɹ3 ɪ1 ɪ2 ɪ3 z1 z2 z3 ɛ1 ɛ2 ɛ3 n1 n2 n3 t1 t2 t3 ᵻ1 ᵻ2 ᵻ3 d1 d2 d3 # ɪ1 ɪ2 ɪ3 n1 n2 n3 ð1 ð2 ð3 ɪ1 ɪ2 ɪ3 # ɛ1 ɛ2 ɛ3 k1 k2 k3 s1 s2 s3 ɪ1 ɪ2 ɪ3 b1 b2 b3 ɪ1 ɪ2 ɪ3 ʃ1 ʃ2 ʃ3 ə1 ə2 ə3 n1 n2 n3 $1 $2 $3

The data provided for each file include:

  • An ID such as LJ001-0001. The MFCC for this file will have the same ID. The waveform name in the original corpus, if you want to listen, is LJ001-0001.wav. The original corpus contains 13,100 files, of which our data distribution contains only 1500: 1000 training files, 250 dev files, and 250 eval files.
  • A word transcript, giving the words the narrator read, all in a single string, separated by spaces and punctuation.
  • A phone transcript, showing the phonemes that make up those words, all in a single string. These are written using the symbols of the International Phonetic Alphabet (IPA), with punctuation and spacing copied from the text.
  • A senone transcript, showing the sequence of HMM states (senones) making up those phonemes, all in a single string, separated by spaces (so you need to use transcripts[k0]['senone'].split() to get a list). In general, there are three senones per phoneme, modeling the beginning, middle, and end of the phoneme. Word boundaries are marked with #. The longer silences implied by punctuation, and by the beginning and ending of each file, are marked with the three-senone sequence $1,$2,$3.

The goal of this MP is to develop an HMM in which every senone is a state, train the senone models, and then use the trained models to find the time-alignment of phones to the waveform.

The list of all distinct phones is provided as transcripts['phoneset']. The list of all distinct senones is provided as transcripts['senoneset'].

In [7]:
phones = transcripts['phoneset'].split()
print('There are %d distinct phones:'%(len(phones)))
print(phones)
print('')
senones = transcripts['senoneset'].split()
print('There are %d distinct senones:'%(len(senones)))
print(senones)
There are 46 distinct phones:
['#', '$', 'a', 'b', 'd', 'e', 'f', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'x', 'z', 'æ', 'ð', 'ŋ', 'ɐ', 'ɑ', 'ɔ', 'ə', 'ɚ', 'ɛ', 'ɜ', 'ɡ', 'ɪ', 'ɫ', 'ɹ', 'ɾ', 'ʃ', 'ʊ', 'ʌ', 'ʒ', 'ʔ', 'θ', 'ᵻ']

There are 136 distinct senones:
['#', '$1', '$2', '$3', 'a1', 'a2', 'a3', 'b1', 'b2', 'b3', 'd1', 'd2', 'd3', 'e1', 'e2', 'e3', 'f1', 'f2', 'f3', 'h1', 'h2', 'h3', 'i1', 'i2', 'i3', 'j1', 'j2', 'j3', 'k1', 'k2', 'k3', 'l1', 'l2', 'l3', 'm1', 'm2', 'm3', 'n1', 'n2', 'n3', 'o1', 'o2', 'o3', 'p1', 'p2', 'p3', 'r1', 'r2', 'r3', 's1', 's2', 's3', 't1', 't2', 't3', 'u1', 'u2', 'u3', 'v1', 'v2', 'v3', 'w1', 'w2', 'w3', 'x1', 'x2', 'x3', 'z1', 'z2', 'z3', 'æ1', 'æ2', 'æ3', 'ð1', 'ð2', 'ð3', 'ŋ1', 'ŋ2', 'ŋ3', 'ɐ1', 'ɐ2', 'ɐ3', 'ɑ1', 'ɑ2', 'ɑ3', 'ɔ1', 'ɔ2', 'ɔ3', 'ə1', 'ə2', 'ə3', 'ɚ1', 'ɚ2', 'ɚ3', 'ɛ1', 'ɛ2', 'ɛ3', 'ɜ1', 'ɜ2', 'ɜ3', 'ɡ1', 'ɡ2', 'ɡ3', 'ɪ1', 'ɪ2', 'ɪ3', 'ɫ1', 'ɫ2', 'ɫ3', 'ɹ1', 'ɹ2', 'ɹ3', 'ɾ1', 'ɾ2', 'ɾ3', 'ʃ1', 'ʃ2', 'ʃ3', 'ʊ1', 'ʊ2', 'ʊ3', 'ʌ1', 'ʌ2', 'ʌ3', 'ʒ1', 'ʒ2', 'ʒ3', 'ʔ1', 'ʔ2', 'ʔ3', 'θ1', 'θ2', 'θ3', 'ᵻ1', 'ᵻ2', 'ᵻ3']

The audio waveforms are not provided to you (you can download them from the https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2). Instead, mfcc.hdf5 provides you with mel-frequency cepstral coefficients for each audio file.

There are a total of 1000 training files (sampled from the 12,600 training files in the original LJSpeech distribution), plus 250 dev files, and 250 eval files.

In [8]:
import h5py
data = {}
with h5py.File('mfcc.hdf5','r') as f:
    for group in f.keys():
        data[group] = {}
        for id in list(f[group].keys())[:1000]:
            data[group][id] = f[group][id][:]
print(data.keys())
dict_keys(['dev', 'eval', 'train'])
In [9]:
print('Training MFCCs are provided for %d files'%(len(data['train'].keys())))
mfcc0 = data['train'][k0]
print('mfcc[%s]: %d frames, each w/%d MFCCs'%(k0,mfcc0.shape[1],mfcc0.shape[0]))
Training MFCCs are provided for 1000 files
mfcc[LJ001-0001]: 968 frames, each w/20 MFCCs

The MFCC (mel-frequency cepstral coefficients) are computed from an audio waveform using the following steps:

  1. Compute the spectrogram
  2. Warp the frequency axis, from linear frequency to mel frequency. The result of this step is called a melspectrogram
  3. Compute the inverse discrete cosine transform (IDCT) of the melspectrogram

The IDCT step is used to (approximately) decorrelate the features. Melspectrogram samples at neighboring frequencies (and spectrogram samples at neighboring frequencies) are highly correlated, which means that the ASR would need to waste parameters modeling the correlation. Taking the IDCT helps with modeling, a lot.

However, taking the IDCT makes the features much harder for a human being to understand. For this reason, I recommend that, whenever you want to visualize the time alignment of any phoneme transcript to the MFCC, you should first:

  1. inverse-DCT to get back the melspectrogram. Librosa makes this easy: you can use the function mfcc_to_mel.
  2. Convert from amplitude to dB, with a dynamic range of 50 or 60dB
  3. Plot only a reasonable amount of time, e.g., 1.5 seconds (150 frames).
In [11]:
import librosa
import matplotlib.pyplot as plt

def invert_mfcc(mfcc):
    step1 = librosa.feature.inverse.mfcc_to_mel(mfcc)
    step2 = librosa.amplitude_to_db(step1,top_db=50)
    return step2

fs = 22050

fig, ax = plt.subplots(2,1,figsize=(14,8))
im0=librosa.display.specshow(
    invert_mfcc(data['train'][k0]),
    sr=22050,
    hop_length=int(0.01*fs),
    win_length=int(0.025*fs),
    y_axis='mel',
    x_axis='s',
    fmax=8000,
    ax=ax[0]
)
ax[0].set_title('Melspectrogram of %s: %s...'%(k0,transcripts['train'][k0]['word'][:21]),fontsize=18)
plt.colorbar(im0,ax=ax[0])
im1=ax[1].imshow(data['train'][k0][:,:150],aspect='auto')
ax[1].set_xlabel('Time (frames)',fontsize=18)
ax[1].set_ylabel('Quefrency (samples)',fontsize=18)
ax[1].set_title('MFCC of %s: %s...'%(k0,transcripts['train'][k0]['phone'][:19]),fontsize=18)
plt.colorbar(im1,ax=ax[1])
fig.tight_layout()

The MFCC has a lot less information than the original waveform (only 20 numbers every 10ms, as opposed to 220.5 samples every 10ms in the original waveform), so you might reasonably ask:

Is that enough information for speech recognition?

You can test this by resynthesizing speech audio from the MFCC. The function mfcc_to_audio inverts the MFCC to melspectrogram, then unwarps to a linear spectrogram, then uses Griffin-Lim to resynthesize speech. The result sounds horribly artificial, but is usually intelligible!

In [12]:
import IPython.display
wav = librosa.feature.inverse.mfcc_to_audio(
    data['train'][k0],
    sr=22050,
    hop_length=int(0.01*fs),
    win_length=int(0.025*fs),
    fmax=8000
)
IPython.display.Audio(data=wav,rate=fs)
Out[12]:
Your browser does not support the audio element.

A warning about computation time¶

As you've seen, there are 1000 training files available. Every function in this MP will include a parameter ntrain that specifies how many of those files you want to process.

  • The more training files you use, the better your results will look.
  • You are not graded on how good your results are, you are only graded on whether or not your code produces correct results for 1 file at a time.
  • Test your code with ntrain=1. Debug that way until it works.
  • Then time your code using ntrain=1. Try different implementations to see if you can reduce the compute time while maintaining correct results. --- this step is not required to get points for the MP, it only matters if you want to be able to train on more files.
  • Once you've got a reasonable compute time, try training with a larger value of ntrain, perhaps something like ntrain=250. -- this step is not required to get points for the MP, it only matters if you're trying to get better speech recognition performance.
In [13]:
ntrain=250


Part 1: Transition Posterior Probabilities¶

Training a speech recognizer basically alternates three steps:

  1. Use the current model to (hard- or soft-)resegment the audio
  2. Use the segmentation to compute the posterior probabilities of every possible transition in every frame of the training data
  3. Use the transition posteriors to retrain the model

Which comes first: the segmentation, the transition posteriors, or the model?

A reasonable place to start is by uniformly segmenting every utterance. Find the number of senones in each utterance, and then give each senone the same number of frames.

Let's write the alignment matrices for the uthuth utterance using the capital letters Alpha and Beta, which, unfortunately, look exactly like A and B, but in a non-italic font: Au=[αu,1,1⋯αu,1,Tu⋮⋱⋮αu,Mu,1⋯αu,Mu,Tu],   B=[βu,1,1⋯βu,1,Tu⋮⋱⋮βu,Mu,1⋯βu,Mu,Tu],Au=⎡⎢ ⎢ ⎢⎣αu,1,1⋯αu,1,Tu⋮⋱⋮αu,Mu,1⋯αu,Mu,Tu⎤⎥ ⎥ ⎥⎦,   B=⎡⎢ ⎢ ⎢⎣βu,1,1⋯βu,1,Tu⋮⋱⋮βu,Mu,1⋯βu,Mu,Tu⎤⎥ ⎥ ⎥⎦,

where MuMu is the number of senones in the transcription of utterance uu, TuTu is the number of MFCC frames, and αα and ββ are defined as

αu,m,t=c1(u,t)Pr{x1,…,xt,qt=m}αu,m,t=c1(u,t)Pr{x1,…,xt,qt=m}βu,m,t=c2(u,t)Pr{xt,…,xTu|qt=m}βu,m,t=c2(u,t)Pr{xt,…,xTu|qt=m}

where the constants c1c1 and c2c2 can be anything, as long as they are independent of mm.

A "hard" alignment is one that allows αu,m,tαu,m,t and βu,m,tβu,m,t to each be nonzero for only one value of mm for each tt, which we can call m∗(t)m∗(t). Since c1(u,t)c1(u,t) and c2(u,t)c2(u,t) can be anything, a "hard" alignment can be signaled by just making αu,m∗(t),t=βu,m∗(t),t=1αu,m∗(t),t=βu,m∗(t),t=1, and αu,m,t=βu,m,t=0αu,m,t=βu,m,t=0 for m≠m∗(t)m≠m∗(t).

When we don't yet have any useful segmentation information, a reasonable place to start is by giving every senone the same number of frames:

αu,m,t=βu,m,t={1(m−1)TuMu≤t<mTuMu0otherwiseαu,m,t=βu,m,t={1(m−1)TuMu≤t<mTuMu0otherwise
In [14]:
import numpy as np
Alpha0 = {}
Beta0 = {}
for u in list(data['train'].keys())[:ntrain]:
    Y = transcripts['train'][u]['senone'].split()
    M = len(Y)
    T = data['train'][u].shape[1]
    Alpha = np.zeros((M,T))
    for m in range(M):
        Alpha[m,int(m*T/M):int((m+1)*T/M)]=1
    Alpha0[u]=Alpha
    Beta0[u]=Alpha
In [15]:
fig, ax = plt.subplots(2,1,figsize=(14,8))
im0=librosa.display.specshow(
    invert_mfcc(data['train'][k0])[:,:150],
    sr=fs,
    hop_length=int(0.01*fs),
    win_length=int(0.025*fs),
    y_axis='mel',
    x_axis='s',
    fmax=8000,
    ax=ax[0]
)
ax[0].set_title('Melspectrogram of %s: %s...'%(k0,transcripts['train'][k0]['word'][:21]),fontsize=18)
plt.colorbar(im0,ax=ax[0])
im1=ax[1].imshow(Alpha0[k0][:60,:150],aspect='auto')
ax[1].set_xlabel('Time (frames)',fontsize=18)
ax[1].set_ylabel('Senone index',fontsize=18)
N,T = Alpha0[k0].shape
ax[1].set_title('Uniform initial segmentation of %s: %d senones, %d frames'%(k0,N,T),fontsize=18)
fig.tight_layout()

Let yu=[y1,…,yMu]Tyu=[y1,…,yMu]T be the sequence of senone labels for the uthuth utterance, and let qu=[q1,…,qTu]Tqu=[q1,…,qTu]T be the state alignment, where qtqt is an integer in the range 1≤qt≤Mu1≤qt≤Mu.

HMM re-estimation is easiest (and much faster, computationally) if we first convert the alignment probabilities αα and ββ to a set of transition posterior probabilities, defined as

ξu,m,t,d=Pr{qt=m,qt+1=m+d|x1,…,xTu}ξu,m,t,d=Pr{qt=m,qt+1=m+d|x1,…,xTu}

Usually we assume that every senone in the transcription needs to claim at least one frame in the utterance. With that assumption, from any state (m,t)(m,t), it is only possible to transition to either (m,t+1)(m,t+1) or (m+1,d+1)(m+1,d+1), so the transition posterior probabilities are:

ξu,m,t,d={αu,m,taym,ym+dβu,m+d,t+10≤d≤1,1≤m,m+d≤Mu,1≤t<Tuαu,m,Tud=0,1≤m≤Mu,t=Tu0otherwise
In [16]:
import importlib,submitted
importlib.reload(submitted)
help(submitted.transition_posteriors)
Help on function transition_posteriors in module submitted:

transition_posteriors(Alphadict, Betadict, transcripts, A, ntrain)
    Calculate the posterior probability of each
    transition given the training data and its
    transcript.
    
    @param:
    Alphadict[u][m,t] is the probability of the path 
     ending at the m'th senone at time t
    Betadict[u][m,t] is the probability of the path 
     starting at the m'th senone at time t
    transcripts[u]['senone'].split() is a 
     list of the senones in utterance u
    mfcc[u][:,t] is the t'th MFCC vector 
     in utterance u
    A[i][j] is probability of a transition from 
     senone i to senone j
    ntrain is the number of utterances to process
    
    @return:
    xi[u][m,t,d] is the posterior probability of a
     transition from the m'th to the (m+d)'th senone
     in utterance u at time t.  len(xi)==ntrain.

As a first step, debug using ntrain=1.

As an inital A matrix, we can use a matrix of all ones. This violates the laws of probability, but it makes the code easier to debug.

In [17]:
import numpy as np
import importlib, submitted, time
importlib.reload(submitted)

A_ones = {i:{j:1 for j in senones} for i in senones}

processing_start = time.time()
xi0=submitted.transition_posteriors(
    Alpha0,Beta0,
    transcripts['train'],
    A_ones,1
)
print('Processing required %g seconds'%(time.time()-processing_start))
Processing required 0.00750899 seconds
In [18]:
fig, ax = plt.subplots(2,1,figsize=(14,8))
im0=librosa.display.specshow(
    invert_mfcc(data['train'][k0])[:,:150],
    sr=22050,
    hop_length=int(0.01*fs),
    win_length=int(0.025*fs),
    y_axis='mel',
    x_axis='s',
    fmax=8000,
    ax=ax[0]
)
ax[0].set_title('Melspectrogram of %s: %s...'%(k0,transcripts['train'][k0]['word'][:21]),fontsize=18)
plt.colorbar(im0,ax=ax[0])
im1=ax[1].imshow(xi0[k0][:60,:150,0]+2*xi0[k0][:60,:150,1],aspect='auto')
plt.colorbar(im1,ax=ax[1])
ax[1].set_xlabel('Time (frames)',fontsize=18)
ax[1].set_ylabel('Senone index',fontsize=18)
ax[1].set_title('Transition posteriors (1=self,2=forward)',fontsize=18)
fig.tight_layout()

Once your code is working, try computing xi matrices for ntrain of the training files.

In [19]:
processing_start = time.time()
xi0=submitted.transition_posteriors(
    Alpha0,Beta0,
    transcripts['train'],
    A_ones,ntrain
)
print('Processing required %g seconds'%(time.time()-processing_start))
Processing required 1.32532 seconds


Part 2: reestimate: find model parameters given transition posteriors¶

The three types of model parameters for a Gaussian HMM are the transition probabilities, the state mean vectors, and the state variance vectors, defined as:

ai,j=Pr{yqt=j|yqt−1=i},   1≤i,j≤Nμi=E[xt|yqt=i],  1≤i≤Nσ2i=E[(xt−μi)2|yqt=i],  1≤i≤N,

where N is the number of distinct senones (the size of the list transcripts['senoneset'].split()), and (xt−μi)2 denotes square each scalar element of the vector. Given the transition posteriors ξu,m,t,d, the model re-estimation formulas are:

ai,j=∑u∑m:ym=i∑d:ym+d=j∑Tut=1ξu,m,t,d∑u∑m:ym=i∑min(Mu−m,1)d=0∑Tut=1ξu,m,t,dμi=∑u∑m:ym=i∑min(Mu−m,1)d=0∑Tut=1ξu,m,t,txt∑u∑m:ym=i∑min(Mu−m,1)d=0∑Tut=1ξu,m,t,dσ2i=max(σ2min,∑u∑m:ym=i∑min(Mu−m,1)d=0∑Tut=1ξu,m,t,d(xt−μi)2∑u∑m:ym=i∑min(Mu−m,1)d=0∑Tut=1ξu,m,t,d),

where σ2min is a hyperparameter, used to make sure that the variance-weighted feature distances don't get too large.

Notice that each denominator is just the sum of the ai,j numerator, summed across j. For this reason, we only need accumulators for the numerators, not for the denominators.

BTW, to make all the denominators equal, we had to assume that there is a transition starting from the last time step. Can you see what extra transition was assumed by this formula?

In [20]:
import importlib, submitted
importlib.reload(submitted)
help(submitted.reestimate)
Help on function reestimate in module submitted:

reestimate(xi, senones, transcripts, mfcc, minvar, ntrain)
    Maximum-likelihood re-estimation 
    of HMM transition probabilities.
    
    @param:
    xi[u][m,t,d] is the posterior probability of a
     transition from the m'th to the (m+d)'th senone
     in utterance u at time t.
    senones is a list of the senone labels
    transcripts[u]['senone'].split() is a 
     list of the senones in utterance u
    mfcc[u][:,t] is the t'th MFCC vector 
     in utterance u
    minvar (scalar): make sure all variances
     are greater than or equal to this number.
     Also, use this value for any variance terms
     whose calculation would cause a divide-by-zero error.
    ntrain is the number of utterances to process
    
    @return:
    A[i][j] is the probability of a transition
      from senone i to senone j (i,j strings).
      If calculating A[i][j] would divide by zero,
      set A[i][j]=0.
    mu[i] is the reestimated mean (an np.ndarray)
      vector for senone i (a string).
      If calculating mu[i] would divide by zero,
      set mu[i]=np.zeros(nfeats).
    var[i] is the reestimated variance vector
      (an ndarray) for senone i (a string).
      If calculating var[i] would divide by zero,
      set var[i]=np.tile(minvar, nfeats).
    A_numerator[i][j] is the reestimation numerator 
      for transitions from senone i to j, including
      the transition from time T to time T+1
    mu_numerator[i] is the reestimation numerator
     for the mean vector of senone i
    var_numerator[i] is the reestimation numerator
     for the variance vector of senone i

In [21]:
import numpy as np
import importlib, submitted, time
importlib.reload(submitted)

processing_start = time.time()
A0,mu0,var0,A_num,mu_num,var_num=submitted.reestimate(
    xi0,
    transcripts['senoneset'].split(),
    transcripts['train'],
    data['train'],
    1,
    1
)
print('Processing required %g seconds'%(time.time()-processing_start))
Processing required 0.102285 seconds

Now let's test some values.

  • A_num['$1','$2'] should equal 4, because $1 occurs 4 times in this utterance
  • A_num['$1','$1'] should be more than A_num['$1','$2'], because the there are more than 2 frames per senone in this file.
  • A_num['$1','$3'] should be zero, because that transition should never occur.
  • A['o1']['o2'] should be A_num['o1']['o2']/sum(A_num['o1'].values())
  • sum(A_num['a1'].values()) should be zero, because the phoneme a does not occur in this training file
  • A['a1']['a2'] and mu['a1'][0] should both be zero, but var['a1'][0] should be minvar, because those are the default settings specified for variables whose computation would cause a divide-by-zero error.
  • mu['o1'] should be mu_num['o1']/sum(A_num['o1'].values())
  • var['o1'] should be var_num['o1']/sum(A_num['o1'].values())
  • If the model is well-trained, the mean vector for vowels like o should have higher energy (zeroth cepstral coefficient) than the meah vector for silent phonemes like $. It's not certain that this will happen for this initial training, because the starting segmentation is so bad.
In [22]:
print('A_num["$1"]["$2"] is',A_num['$1']['$2'])
print('A_num["$1"]["$1"] is',A_num['$1']['$1'])
print('A_num["$1"]["$3"] is',A_num['$1']['$3'])
print('A0["o1"]["o2"] is',A0['o1']['o2'])
print('A_num["o1"]["o2"]/denom is',A_num['o1']['o2']/sum(A_num['o1'].values()))
print('sum(A_num["a1"].values()) is',sum(A_num["a1"].values()))
print('A0["a1"]["a2"] is',A0["a1"]["a2"])
print('mu0["a1"][0] is',mu0["a1"][0])
print('var0["a1"][0] is',var0["a1"][0])
A_num["$1"]["$2"] is 4.0
A_num["$1"]["$1"] is 5.0
A_num["$1"]["$3"] is 0
A0["o1"]["o2"] is 0.3333333333333333
A_num["o1"]["o2"]/denom is 0.3333333333333333
sum(A_num["a1"].values()) is 0
A0["a1"]["a2"] is 0
mu0["a1"][0] is 0.0
var0["a1"][0] is 1
In [23]:
import matplotlib.pyplot as plt
nfeats = len(mu0['o1'])
n = np.arange(nfeats)
fig, ax = plt.subplots(2,2,figsize=(14,8))
ax[0,0].plot(n,mu0['o1'],n,mu_num['o1']/sum(A_num['o1'].values()))
ax[0,0].set_title('Mean of senone o1')
ax[1,0].plot(n,var0['o1'],n,var_num['o1']/sum(A_num['o1'].values()))
ax[1,0].set_title('Variance of senone o1')
ax[0,1].plot(n,mu0['$1'],n,mu_num['$1']/sum(A_num['$1'].values()))
ax[0,1].set_title('Mean of senone $1')
ax[1,1].plot(n,var0['$1'],n,var_num['$1']/sum(A_num['$1'].values()))
ax[1,1].set_title('Variance of senone $1')
Out[23]:
Text(0.5, 1.0, 'Variance of senone $1')

Once your code works well, and reasonably quickly, with one training file, try it with ntrain training files.

In [24]:
import numpy as np
import importlib, submitted, time
importlib.reload(submitted)

processing_start = time.time()
A0,mu0,var0,A_num,mu_num,var_num=submitted.reestimate(
    xi0,
    transcripts['senoneset'].split(),
    transcripts['train'],
    data['train'],
    1,
    ntrain
)
print('Processing required %g seconds'%(time.time()-processing_start))
Processing required 11.3008 seconds


Part 3: Calculate the observation pdf¶

The log probability of a vector xt given the ith senone is

logbi(xt)=−12D∑d=1((xt,d−μi,d)2σ2i,d+log(2πσ2i,d))
In [25]:
importlib.reload(submitted)
help(submitted.logB)
Help on function logB in module submitted:

logB(mfcc, mu, var, ntrain)
    Calculate the log observation probability densities for 
    each state at each time assuming a diagonal-covariance Gaussian 
    probability density model.
    
    @param:
    mfcc[u][:,t] is the t'th MFCC feature vector in 
      utterance u (u is a string, t an int)
    mu[i] is the mean feature vector for senone i 
    var[i] is the vector of feature variances for senone i
    ntrain is the number of utterances to process
    
    @return:
    logB[u][i][t]=log Pr{mfcc[u][:,t] | senone=i}
      u and i are strings, t is an int.
      len(logB)==ntrain.

In [26]:
importlib.reload(submitted)

processing_start=time.time()
logB0 = submitted.logB(data['train'],mu0,var0,1)
print('len(logB1["LJ001-0001"]["$1"])=',len(logB0['LJ001-0001']['$1']))
print('Processing required %g seconds'%(time.time()-processing_start))
len(logB1["LJ001-0001"]["$1"])= 968
Processing required 0.0484869 seconds

We should find that the number of frames is 968.

  • This method of initialization (uniform segmentation) is so crude that we wind up with all senones having very similar parameters (basically, the average across all frames). However, ...
  • Silence should have slightly higher likelihood in the first few frames, and in other silences, e.g., frames 60-90
  • Other phonemes should have slightly higher likelihood in frames that contain speech
In [27]:
fig, ax = plt.subplots(1,1,figsize=(14,6))
for i in transcripts['train'][k0]['senone'].split()[:20:3]:
    ax.plot(logB0[k0][i][:150],label=i)
ax.legend(fontsize=18)
ax.set_title('Log likelihoods of a few different senones',fontsize=18)
ax.set_xlabel('Time (frame index)',fontsize=18)
Out[27]:
Text(0.5, 0, 'Time (frame index)')
In [28]:
fig, ax = plt.subplots(1,1,figsize=(14,6))
Y = transcripts['train'][k0]['senone'].split()
logB_image = np.empty((len(Y),968))
for m in range(len(Y)):
    logB_image[m,:]=logB0[k0][Y[m]]
im0=ax.imshow(logB_image)
ax.set_title('logB0 of "%s..."'%(transcripts['train'][k0]['word'][:20]),fontsize=18)
ax.set_ylabel('Senone index',fontsize=18)
ax.set_xlabel('Frame index',fontsize=18)
plt.colorbar(im0,ax=ax)
Out[28]:
<matplotlib.colorbar.Colorbar at 0x7fccbeeea0b0>

When your code works correctly and reasonably fast, try applying it to ntrain training utterances.

In [29]:
importlib.reload(submitted)

processing_start=time.time()
logB0 = submitted.logB(data['train'],mu0,var0,ntrain)
print('Processing required %g seconds'%(time.time()-processing_start))
Processing required 7.6602 seconds


Part 4: Viterbi Training¶

Now that we've initialized the model, and calculated the log observation probabilities, we need to re-align the senones to the utterances. This can be done using the Viterbi algorithm.

  1. Initialize: δ1(1)=logby1(xt) δ1(m)=−∞,  m≠1

  2. Iterate δt(n)=maxm∈{n−1,n}δt−1(m)+logaym,yn+logbyn(xt) ψt(n)=argmaxm∈{n−1,n}δt−1(m)+logaym,yn+logbyn(xt)

  3. Terminate and Backtrace m∗(Tu)=Mu m∗(t)=ψt+1(m∗(t+1)) αu,m,t={1m=m∗(t)0otherwise

In [30]:
importlib.reload(submitted)
help(submitted.viterbitrain)
Help on function viterbitrain in module submitted:

viterbitrain(A, logB, transcripts, ntrain)
    Find the maximum likelihood time alignment of each transcript 
    to its MFCC matrix.
    
    @param:
    A[i][j] is probability of a transition from senone i 
      to senone j (i and j are strings)
    logB[u][i][t]=log Pr{mfcc[u][:,t] | senone=i}
      u and i are strings, t is an int
    transcripts[u]['senone'].split() is a list 
      of the senones in utterance u
    ntrain is the number of utterances to process
    
    @return:
    Alphadict[u][m,t]==1 if m'th senone at time t is 
      in the maximum likelihood path from (0,0) to (M-1,T-1),
      otherwise 0.  len(Alphadict)==ntrain.
    Delta[u][m,t] is the log probability of the best 
      path up to m'th senone at time t.  len(Delta)==ntrain.
    Backpointer[u][m,t] is either m-1 or m, the most 
      likely predecessor of m at time t.
      len(Backpointer)==ntrain

In [31]:
importlib.reload(submitted)

processing_start = time.time()
Alpha1,Delta1,Psi1 = submitted.viterbitrain(
    A0,
    logB0,
    transcripts['train'],
    1
)
print('Processing required %g seconds'%(time.time()-processing_start))
Processing required 0.644256 seconds

One way to check our code: δt(m) should be −∞ for any t<m, because we only allow the HMM to advance by at most one senone per frame.

In [32]:
print(Delta1[k0][:4,:4])
[[ -77.3509607  -155.07789671 -242.35723103 -330.93734968]
 [         -inf -157.92764231 -241.83262821 -329.31514669]
 [         -inf          -inf -245.11691661 -329.81442507]
 [         -inf          -inf          -inf -336.02052783]]

We can also check the calculations, to make sure that δt(n) is either δt−1(n)+logayn,yn+logbyn(t) or δt−1(n−1)+logayn−1,yn+logbyn(t), whichever is larger.

In [33]:
logA_test = np.tile(-np.inf,(4,4))
logB_test = np.empty((4,4))
Y = transcripts['train'][k0]['senone'].split()
for m in range(4):
    logB_test[m,:] = logB0[k0][Y[m]][:4]
    for n in range(4):
        if A0[Y[m]][Y[n]]>0:
            logA_test[m,n]=np.log(A0[Y[m]][Y[n]])
print('logA[m,n] is:')
print(logA_test)
print('')
print('logB[m,t] is:')
print(logB_test)
print('')
print('For example, delta[1,3] should be either')
print(Delta1[k0][1,2]+logA_test[1,1]+logB_test[1,3])
print('or')
print(Delta1[k0][0,2]+logA_test[0,1]+logB_test[1,3])
logA[m,n] is:
[[-0.46833654 -0.98361393        -inf        -inf]
 [       -inf -0.43201381 -1.04754018        -inf]
 [-2.58215916        -inf -0.30285175 -5.23010544]
 [       -inf        -inf        -inf -0.44995724]]

logB[m,t] is:
[[-77.3509607  -77.25859947 -86.81099779 -88.11178211]
 [-79.63115263 -79.59306768 -85.77111758 -87.05050467]
 [-78.93853412 -78.9200743  -86.14173412 -86.93425668]
 [-83.22614421 -83.09596919 -84.216702   -85.67350578]]

For example, delta[1,3] should be either
-329.3151466935237
or
-330.3913496280543

We should be able to see the same phenomena (a triangle of -np.inf values at the start of the file; otherwise, values of delta getting gradually more negative as we go from left to right through the utterance) in images of the whole file.

Also, notice that the Alpha1 trajectory should start at the top left, and end at the bottom right.

In [34]:
fig, ax = plt.subplots(3,1,figsize=(14,12))
im0=librosa.display.specshow(
    invert_mfcc(data['train'][k0]),
    sr=fs,
    hop_length=int(0.01*fs),
    win_length=int(0.025*fs),
    y_axis='mel',
    x_axis='s',
    fmax=8000,
    ax=ax[0]
)
ax[0].set_title('Melspectrogram of %s: %s...'%(k0,transcripts['train'][k0]['word'][:21]),fontsize=18)
plt.colorbar(im0,ax=ax[0])
im1=ax[1].imshow(Delta1[k0][:,:],aspect='auto')
plt.colorbar(im1,ax=ax[1])
ax[1].set_xlabel('Time (frames)',fontsize=18)
ax[1].set_ylabel('Senone index',fontsize=18)
M,T = Delta1[k0].shape
ax[1].set_title('logprob best path %s: %d senones, %d frames'%(k0,N,T),fontsize=18)
im2=ax[2].imshow(Alpha1[k0][:,:],aspect='auto')
plt.colorbar(im2,ax=ax[2])
ax[1].set_xlabel('Time (frames)',fontsize=18)
ax[1].set_ylabel('Senone index',fontsize=18)
M,T = Alpha1[k0].shape
ax[2].set_title('Viterbi segmentation of %s: %d senones, %d frames'%(k0,N,T),fontsize=18)
fig.tight_layout()

As you can see, different implementations of the same equation may lead to small differences (about 0.01%) in the values computed. The max operation is sensitive to such differences, so your Alpha1 may differ slightly from the reference Alpha1, but your Delta1 should not differ by more than about 0.01% per frame.

When your code seems to be working, and is fast enough to use on the whole dataset, try using ntrain training files.

In [35]:
importlib.reload(submitted)

processing_start = time.time()
Alpha1,Delta1,Psi1 = submitted.viterbitrain(
    A0,
    logB0,
    transcripts['train'],
    ntrain
)
print('Processing required %g seconds'%(time.time()-processing_start))
Processing required 71.3123 seconds

... and if that worked, you might want to try one more complete iteration of Viterbi training.

  • Use Alpha to recompute xi
  • Use xi to recompute the model
  • Use the model to recompute logB
  • Use logB to recompute the Viterbi alignment
In [36]:
importlib.reload(submitted)

processing_start = time.time()
xi1=submitted.transition_posteriors(
    Alpha1,
    Alpha1,
    transcripts['train'],
    A0,
    ntrain
)
A1,mu1,var1,A_num,mu_num,var_num=submitted.reestimate(
    xi1,
    transcripts['senoneset'].split(),
    transcripts['train'],
    data['train'],
    1,
    ntrain
)
logB1 = submitted.logB(
    data['train'],
    mu1,
    var1,
    ntrain
)
Alpha2,Delta2,Psi2 = submitted.viterbitrain(
    A1,
    logB1,
    transcripts['train'],
    ntrain
)
print('Processing required %g seconds'%(time.time()-processing_start))
Processing required 91.5484 seconds
In [38]:
fig, ax = plt.subplots(3,1,figsize=(14,12))
im0=librosa.display.specshow(
    invert_mfcc(data['train'][k0]),
    sr=fs,
    hop_length=int(0.01*fs),
    win_length=int(0.025*fs),
    y_axis='mel',
    x_axis='s',
    fmax=8000,
    ax=ax[0]
)
ax[0].set_title('Melspectrogram of %s: %s...'%(k0,transcripts['train'][k0]['word'][:21]),fontsize=18)
plt.colorbar(im0,ax=ax[0])
im1=ax[1].imshow(Delta2[k0][:,:],aspect='auto')
plt.colorbar(im1,ax=ax[1])
ax[1].set_xlabel('Time (frames)',fontsize=18)
ax[1].set_ylabel('Senone index',fontsize=18)
M,T = Delta1[k0].shape
ax[1].set_title('logprob best path %s: %d senones, %d frames'%(k0,N,T),fontsize=18)
im2=ax[2].imshow(Alpha2[k0][:,:],aspect='auto')
plt.colorbar(im2,ax=ax[2])
ax[1].set_xlabel('Time (frames)',fontsize=18)
ax[1].set_ylabel('Senone index',fontsize=18)
M,T = Alpha2[k0].shape
ax[2].set_title('Viterbi segmentation of %s: %d senones, %d frames'%(k0,N,T),fontsize=18)
fig.tight_layout()


Part 5: Show segmentations¶

For the last part of the main MP, let's just find a good way to show the force-aligned segmentations. For this purpose, we will use matplotlib's xticks and xticklabels.

In [39]:
importlib.reload(submitted)
help(submitted.show_segmentations)
Help on function show_segmentations in module submitted:

show_segmentations(Alphadict, transcripts, ntrain)
    Generate xticks and xticklabels to show phone segmentations.
    
    @param:
    Alphadict[u][m,t] is the probability of the path 
     ending at the m'th senone at time t
    transcripts[u]['senone'].split() is a 
     list of the senones in utterance u
    ntrain is the number of utterances to process
    
    @return:
    xticks[u] is a list of frame indices at which
      phones start, i.e., for Y[m] that is the
      first senone of the k'th phone, 
      xticks[u][k] should be the first value of 
      t such that Alphadict[u][m,t]>0.5.
      Include long-silence phones ($), but not
      word-boundary phones (#).
    xticklabels[u] is a list of the corresponding
      phone strings.

First, let's show the initial uniform segmentations:

In [40]:
importlib.reload(submitted)
xticks, xticklabels=submitted.show_segmentations(
    Alpha0,
    transcripts['train'],
    1
)
print(xticks[k0][:19])
print(xticklabels[k0][:19])
[0, 7, 15, 23, 31, 39, 47, 55, 63, 71, 79, 87, 94, 105, 113, 121, 129, 137, 147]
['$', 'p', 'ɹ', 'ɪ', 'n', 't', 'ɪ', 'ŋ', '$', 'ɪ', 'n', 'ð', 'ɪ', 'o', 'ʊ', 'n', 'l', 'i', 's']
In [41]:
fig, ax = plt.subplots(1,1,figsize=(14,6))
im0=librosa.display.specshow(
    invert_mfcc(data['train'][k0])[:,:150],
    sr=fs,
    hop_length=int(0.01*fs),
    win_length=int(0.025*fs),
    y_axis='mel',
    x_axis='frames',
    fmax=8000,
    ax=ax
)
ax.set_xticks(xticks[k0][:19])
ax.set_xticklabels(xticklabels[k0][:19],fontsize=18)
ax.set_xlabel('Phone alignment times, iteration 0',fontsize=18)
ax.set_title(transcripts['train'][k0]['word'][:21]+'...',fontsize=18)
Out[41]:
Text(0.5, 1.0, 'Printing, in the only...')

Then, after the first round of Viterbi segmentation:

In [42]:
importlib.reload(submitted)
xticks, xticklabels=submitted.show_segmentations(
    Alpha1,
    transcripts['train'],
    1
)
fig, ax = plt.subplots(1,1,figsize=(14,6))
im0=librosa.display.specshow(
    invert_mfcc(data['train'][k0])[:,:150],
    sr=fs,
    hop_length=int(0.01*fs),
    win_length=int(0.025*fs),
    y_axis='mel',
    x_axis='frames',
    fmax=8000,
    ax=ax
)
ax.set_xticks(xticks[k0][:19])
ax.set_xticklabels(xticklabels[k0][:19],fontsize=18)
ax.set_xlabel('Phone alignment times, iteration 1',fontsize=18)
ax.set_title(transcripts['train'][k0]['word'][:21]+'...',fontsize=18)
Out[42]:
Text(0.5, 1.0, 'Printing, in the only...')

After two rounds:

In [43]:
importlib.reload(submitted)
xticks, xticklabels=submitted.show_segmentations(
    Alpha2,
    transcripts['train'],
    1
)
fig, ax = plt.subplots(1,1,figsize=(14,6))
im0=librosa.display.specshow(
    invert_mfcc(data['train'][k0])[:,:150],
    sr=fs,
    hop_length=int(0.01*fs),
    win_length=int(0.025*fs),
    y_axis='mel',
    x_axis='frames',
    fmax=8000,
    ax=ax
)
ax.set_xticks(xticks[k0][:19])
ax.set_xticklabels(xticklabels[k0][:19],fontsize=18)
ax.set_xlabel('Phone alignment times, iteration 1',fontsize=18)
ax.set_title(transcripts['train'][k0]['word'][:21]+'...',fontsize=18)
Out[43]:
Text(0.5, 1.0, 'Printing, in the only...')


Extra Credit: Recognize Speech!!¶

In order to transcribe an unknown speech file, the usual method is to apply the Viterbi algorithm, but with one twist: there is no fixed transcription that we need to match. Any senone, j, can follow any other senone, i, as long as ai,j>0.

With this twist, the Viterbi variables need to be indexed by the senone label i rather than the transcription index m, because there is no transcription.

  1. Initialize: δ1(i)=logbi(xt),  ∀i

  2. Iterate δt(j)=maxiδt−1(i)+logai,j+logbj(xt),  ∀t,∀j ψt(j)=argmaxiδt−1(i)+logai,j+logbj(xt),  ∀t,∀j

  3. Terminate and Backtrace i∗(Tu)=argmaxiδTu(i) i∗(t)=ψt+1(i∗(t+1)),  ∀t<Tu αu,i,t={1i=i∗(t)0otherwise

Although the form of this algorithm is just like Viterbi training, the computational complexity is about 34X higher, because, at each time, for each senone, the number of possible predecessors has increased from 2 to the set of all senones (136).

In [44]:
import importlib, extra
importlib.reload(extra)
help(extra.recognize)
Help on function recognize in module extra:

recognize(A, logB, ntest)
    Find the maximum likelihood transcription of test 
    utterances
    
    @param:
    A[i][j] is probability of a transition 
      from senone i to senone j (i and j are strings)
    logB[u][i][t]=log Pr{mfcc[u][:,t] | senone=i}, 
      u and i are strings, t is an int
    ntest: number of utterances to process
    
    @return:
    Alpha[u][i][t]==1 if i is senone in the best 
      path at time t, otherwise 0
    Delta[u][i][t] is the log probability of the best 
      path up to senone i (str) at time t (int)
    Psi[u][i][t] is a string, 
      specifying the most likely predecessor of senone i 
      at time t.
    Num[u][i][t] is the number of frames that senone i
      has been repeated on the best path at time t
      len(hyp)==len(Delta)==len(Psi)==len(Num)==ntest

Let's try to recognize just the first utterance in the train subcorpus.

In [46]:
importlib.reload(extra)
logB = submitted.logB(
    data['train'],
    mu1,
    var1,
    1
)
Alpha, Delta, Psi = extra.recognize(
    A1,
    logB,
    1
)
In [47]:
bestpath = []
u = list(Alpha.keys())[0]
for t in range(len(Alpha[u]['$1'])):
    for i in transcripts['senoneset'].split():
        if Alpha[u][i][t]==1:
            bestpath.append(i)

print('There were %d frames in the best path'%(len(bestpath)))

recognized = '$'
for s in bestpath:
    if s[0] != recognized[-1]:
        recognized += s[0]
print(recognized.replace('$',' ').replace('#',' '))
There were 968 frames in the best path
 θ ɐ tɪn ɪndeɪlɪsɛnz ðə wɪʃ ɪ oɹɛpɛz ns ðɚnt sɛfɹɛz fɑmɑm stɪʃənaʊfɑmɔl ɪ ɑɹ ksᵻkæf kstɹɛz ɛntᵻtɪni ɛksɪʃn 

Well! It's not correct, but some of the strings sound similar to the correct utterance (e.g., the last word is "exhition" instead of "exhibition"). With more training iterations we should converge to 100% accuracy on the training corpus. Then, with more training data, we might start to get good performance on a test corpus.

The tests tests/test_visible.py and tests/test_extra.py are both run by the grade.py code, so you can test both by running python grade.py. Once you confirm that you pass the tests on your own machine, upload your code to Gradescope!

In [ ]: