In this lab, you'll train and test hidden Markov models in order to find phoneme segment boundaries in a large speech corpus.
The data for this MP are drawn from the LJSpeech corpus. Transcriptions have been converted to phonemes using phonemizer. Waveforms have been converted to MFCC using librosa.feature.mfcc, with settings of hop_length=10ms, win_length=25ms, n_mfcc=16, fmax=8000.
Warning: This Juyter notebook will use librosa extensively, but the autograder will not have librosa. All of the code in your submitted.py
should use numpy only, not librosa.
The transcripts are provided to you in the file transcripts.json
.
import json, random
with open('transcripts.json') as f:
transcripts=json.load(f)
print(transcripts.keys())
dict_keys(['train', 'dev', 'eval', 'phoneset', 'senoneset'])
print('Training transcripts are available for %d files'%(len(transcripts['train'])))
k0 = list(transcripts['train'].keys())[0]
print('The first file is',k0)
print('Its word transcript is:',transcripts['train'][k0]['word'])
print('Its phone transcript is:',transcripts['train'][k0]['phone'])
print('Its senone transcript is:',transcripts['train'][k0]['senone'])
Training transcripts are available for 1000 files The first file is LJ001-0001 Its word transcript is: Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition Its phone transcript is: pɹɪntɪŋ, ɪnðɪ oʊnli sɛns wɪð wɪtʃ wi ɑɹ æt pɹɛzənt kənsɜnd, dɪfɚz fɹʌm moʊst ɪf nɑt fɹʌm ɔl ðɪ ɑɹts ænd kɹæfts ɹɛpɹɪzɛntᵻd ɪnðɪ ɛksɪbɪʃən Its senone transcript is: $1 $2 $3 p1 p2 p3 ɹ1 ɹ2 ɹ3 ɪ1 ɪ2 ɪ3 n1 n2 n3 t1 t2 t3 ɪ1 ɪ2 ɪ3 ŋ1 ŋ2 ŋ3 $1 $2 $3 ɪ1 ɪ2 ɪ3 n1 n2 n3 ð1 ð2 ð3 ɪ1 ɪ2 ɪ3 # o1 o2 o3 ʊ1 ʊ2 ʊ3 n1 n2 n3 l1 l2 l3 i1 i2 i3 # s1 s2 s3 ɛ1 ɛ2 ɛ3 n1 n2 n3 s1 s2 s3 # w1 w2 w3 ɪ1 ɪ2 ɪ3 ð1 ð2 ð3 # w1 w2 w3 ɪ1 ɪ2 ɪ3 t1 t2 t3 ʃ1 ʃ2 ʃ3 # w1 w2 w3 i1 i2 i3 # ɑ1 ɑ2 ɑ3 ɹ1 ɹ2 ɹ3 # æ1 æ2 æ3 t1 t2 t3 # p1 p2 p3 ɹ1 ɹ2 ɹ3 ɛ1 ɛ2 ɛ3 z1 z2 z3 ə1 ə2 ə3 n1 n2 n3 t1 t2 t3 # k1 k2 k3 ə1 ə2 ə3 n1 n2 n3 s1 s2 s3 ɜ1 ɜ2 ɜ3 n1 n2 n3 d1 d2 d3 $1 $2 $3 d1 d2 d3 ɪ1 ɪ2 ɪ3 f1 f2 f3 ɚ1 ɚ2 ɚ3 z1 z2 z3 # f1 f2 f3 ɹ1 ɹ2 ɹ3 ʌ1 ʌ2 ʌ3 m1 m2 m3 # m1 m2 m3 o1 o2 o3 ʊ1 ʊ2 ʊ3 s1 s2 s3 t1 t2 t3 # ɪ1 ɪ2 ɪ3 f1 f2 f3 # n1 n2 n3 ɑ1 ɑ2 ɑ3 t1 t2 t3 # f1 f2 f3 ɹ1 ɹ2 ɹ3 ʌ1 ʌ2 ʌ3 m1 m2 m3 # ɔ1 ɔ2 ɔ3 l1 l2 l3 # ð1 ð2 ð3 ɪ1 ɪ2 ɪ3 # ɑ1 ɑ2 ɑ3 ɹ1 ɹ2 ɹ3 t1 t2 t3 s1 s2 s3 # æ1 æ2 æ3 n1 n2 n3 d1 d2 d3 # k1 k2 k3 ɹ1 ɹ2 ɹ3 æ1 æ2 æ3 f1 f2 f3 t1 t2 t3 s1 s2 s3 # ɹ1 ɹ2 ɹ3 ɛ1 ɛ2 ɛ3 p1 p2 p3 ɹ1 ɹ2 ɹ3 ɪ1 ɪ2 ɪ3 z1 z2 z3 ɛ1 ɛ2 ɛ3 n1 n2 n3 t1 t2 t3 ᵻ1 ᵻ2 ᵻ3 d1 d2 d3 # ɪ1 ɪ2 ɪ3 n1 n2 n3 ð1 ð2 ð3 ɪ1 ɪ2 ɪ3 # ɛ1 ɛ2 ɛ3 k1 k2 k3 s1 s2 s3 ɪ1 ɪ2 ɪ3 b1 b2 b3 ɪ1 ɪ2 ɪ3 ʃ1 ʃ2 ʃ3 ə1 ə2 ə3 n1 n2 n3 $1 $2 $3
The data provided for each file include:
LJ001-0001
. The MFCC for this file will have the same ID. The waveform name in the original corpus, if you want to listen, is LJ001-0001.wav
. The original corpus contains 13,100 files, of which our data distribution contains only 1500: 1000 training files, 250 dev files, and 250 eval files.word
transcript, giving the words the narrator read, all in a single string, separated by spaces and punctuation.phone
transcript, showing the phonemes that make up those words, all in a single string. These are written using the symbols of the International Phonetic Alphabet (IPA), with punctuation and spacing copied from the text.senone
transcript, showing the sequence of HMM states (senones) making up those phonemes, all in a single string, separated by spaces (so you need to use transcripts[k0]['senone'].split()
to get a list). In general, there are three senones per phoneme, modeling the beginning, middle, and end of the phoneme. Word boundaries are marked with #
. The longer silences implied by punctuation, and by the beginning and ending of each file, are marked with the three-senone sequence $1
,$2
,$3
.The goal of this MP is to develop an HMM in which every senone is a state, train the senone models, and then use the trained models to find the time-alignment of phones to the waveform.
The list of all distinct phones is provided as transcripts['phoneset'].
The list of all distinct senones is provided as transcripts['senoneset']
.
phones = transcripts['phoneset'].split()
print('There are %d distinct phones:'%(len(phones)))
print(phones)
print('')
senones = transcripts['senoneset'].split()
print('There are %d distinct senones:'%(len(senones)))
print(senones)
There are 46 distinct phones: ['#', '$', 'a', 'b', 'd', 'e', 'f', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'x', 'z', 'æ', 'ð', 'ŋ', 'ɐ', 'ɑ', 'ɔ', 'ə', 'ɚ', 'ɛ', 'ɜ', 'ɡ', 'ɪ', 'ɫ', 'ɹ', 'ɾ', 'ʃ', 'ʊ', 'ʌ', 'ʒ', 'ʔ', 'θ', 'ᵻ'] There are 136 distinct senones: ['#', '$1', '$2', '$3', 'a1', 'a2', 'a3', 'b1', 'b2', 'b3', 'd1', 'd2', 'd3', 'e1', 'e2', 'e3', 'f1', 'f2', 'f3', 'h1', 'h2', 'h3', 'i1', 'i2', 'i3', 'j1', 'j2', 'j3', 'k1', 'k2', 'k3', 'l1', 'l2', 'l3', 'm1', 'm2', 'm3', 'n1', 'n2', 'n3', 'o1', 'o2', 'o3', 'p1', 'p2', 'p3', 'r1', 'r2', 'r3', 's1', 's2', 's3', 't1', 't2', 't3', 'u1', 'u2', 'u3', 'v1', 'v2', 'v3', 'w1', 'w2', 'w3', 'x1', 'x2', 'x3', 'z1', 'z2', 'z3', 'æ1', 'æ2', 'æ3', 'ð1', 'ð2', 'ð3', 'ŋ1', 'ŋ2', 'ŋ3', 'ɐ1', 'ɐ2', 'ɐ3', 'ɑ1', 'ɑ2', 'ɑ3', 'ɔ1', 'ɔ2', 'ɔ3', 'ə1', 'ə2', 'ə3', 'ɚ1', 'ɚ2', 'ɚ3', 'ɛ1', 'ɛ2', 'ɛ3', 'ɜ1', 'ɜ2', 'ɜ3', 'ɡ1', 'ɡ2', 'ɡ3', 'ɪ1', 'ɪ2', 'ɪ3', 'ɫ1', 'ɫ2', 'ɫ3', 'ɹ1', 'ɹ2', 'ɹ3', 'ɾ1', 'ɾ2', 'ɾ3', 'ʃ1', 'ʃ2', 'ʃ3', 'ʊ1', 'ʊ2', 'ʊ3', 'ʌ1', 'ʌ2', 'ʌ3', 'ʒ1', 'ʒ2', 'ʒ3', 'ʔ1', 'ʔ2', 'ʔ3', 'θ1', 'θ2', 'θ3', 'ᵻ1', 'ᵻ2', 'ᵻ3']
The audio waveforms are not provided to you (you can download them from the https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2). Instead, mfcc.hdf5
provides you with mel-frequency cepstral coefficients for each audio file.
There are a total of 1000 training files (sampled from the 12,600 training files in the original LJSpeech distribution), plus 250 dev files, and 250 eval files.
import h5py
data = {}
with h5py.File('mfcc.hdf5','r') as f:
for group in f.keys():
data[group] = {}
for id in list(f[group].keys())[:1000]:
data[group][id] = f[group][id][:]
print(data.keys())
dict_keys(['dev', 'eval', 'train'])
print('Training MFCCs are provided for %d files'%(len(data['train'].keys())))
mfcc0 = data['train'][k0]
print('mfcc[%s]: %d frames, each w/%d MFCCs'%(k0,mfcc0.shape[1],mfcc0.shape[0]))
Training MFCCs are provided for 1000 files mfcc[LJ001-0001]: 968 frames, each w/20 MFCCs
The MFCC (mel-frequency cepstral coefficients) are computed from an audio waveform using the following steps:
The IDCT step is used to (approximately) decorrelate the features. Melspectrogram samples at neighboring frequencies (and spectrogram samples at neighboring frequencies) are highly correlated, which means that the ASR would need to waste parameters modeling the correlation. Taking the IDCT helps with modeling, a lot.
However, taking the IDCT makes the features much harder for a human being to understand. For this reason, I recommend that, whenever you want to visualize the time alignment of any phoneme transcript to the MFCC, you should first:
import librosa
import matplotlib.pyplot as plt
def invert_mfcc(mfcc):
step1 = librosa.feature.inverse.mfcc_to_mel(mfcc)
step2 = librosa.amplitude_to_db(step1,top_db=50)
return step2
fs = 22050
fig, ax = plt.subplots(2,1,figsize=(14,8))
im0=librosa.display.specshow(
invert_mfcc(data['train'][k0]),
sr=22050,
hop_length=int(0.01*fs),
win_length=int(0.025*fs),
y_axis='mel',
x_axis='s',
fmax=8000,
ax=ax[0]
)
ax[0].set_title('Melspectrogram of %s: %s...'%(k0,transcripts['train'][k0]['word'][:21]),fontsize=18)
plt.colorbar(im0,ax=ax[0])
im1=ax[1].imshow(data['train'][k0][:,:150],aspect='auto')
ax[1].set_xlabel('Time (frames)',fontsize=18)
ax[1].set_ylabel('Quefrency (samples)',fontsize=18)
ax[1].set_title('MFCC of %s: %s...'%(k0,transcripts['train'][k0]['phone'][:19]),fontsize=18)
plt.colorbar(im1,ax=ax[1])
fig.tight_layout()
The MFCC has a lot less information than the original waveform (only 20 numbers every 10ms, as opposed to 220.5 samples every 10ms in the original waveform), so you might reasonably ask:
Is that enough information for speech recognition?
You can test this by resynthesizing speech audio from the MFCC. The function mfcc_to_audio inverts the MFCC to melspectrogram, then unwarps to a linear spectrogram, then uses Griffin-Lim to resynthesize speech. The result sounds horribly artificial, but is usually intelligible!
import IPython.display
wav = librosa.feature.inverse.mfcc_to_audio(
data['train'][k0],
sr=22050,
hop_length=int(0.01*fs),
win_length=int(0.025*fs),
fmax=8000
)
IPython.display.Audio(data=wav,rate=fs)
As you've seen, there are 1000 training files available. Every function in this MP will include a parameter ntrain
that specifies how many of those files you want to process.
ntrain=1
. Debug that way until it works. ntrain=1
. Try different implementations to see if you can reduce the compute time while maintaining correct results. --- this step is not required to get points for the MP, it only matters if you want to be able to train on more files.ntrain
, perhaps something like ntrain=250
. -- this step is not required to get points for the MP, it only matters if you're trying to get better speech recognition performance.ntrain=250
Training a speech recognizer basically alternates three steps:
Which comes first: the segmentation, the transition posteriors, or the model?
A reasonable place to start is by uniformly segmenting every utterance. Find the number of senones in each utterance, and then give each senone the same number of frames.
Let's write the alignment matrices for the $u^{\textrm{th}}$ utterance using the capital letters Alpha and Beta, which, unfortunately, look exactly like A and B, but in a non-italic font: $$ \mathbf{\mathrm{A}}_u=\left[\begin{array}{ccc} \alpha_{u,1,1}&\cdots&\alpha_{u,1,T_u}\\ \vdots&\ddots&\vdots\\ \alpha_{u,M_u,1}&\cdots&\alpha_{u,M_u,T_u}\end{array}\right],~~~ \mathbf{\mathrm{B}}=\left[\begin{array}{ccc} \beta_{u,1,1}&\cdots&\beta_{u,1,T_u}\\ \vdots&\ddots&\vdots\\ \beta_{u,M_u,1}&\cdots&\beta_{u,M_u,T_u}\end{array}\right], $$
where $M_u$ is the number of senones in the transcription of utterance $u$, $T_u$ is the number of MFCC frames, and $\alpha$ and $\beta$ are defined as
$$\alpha_{u,m,t}=c_1(u,t)\Pr\left\{\mathbf{x}_1,\ldots,\mathbf{x}_t,q_t=m\right\}$$$$\beta_{u,m,t}=c_2(u,t)\Pr\left\{\mathbf{x}_t,\ldots,\mathbf{x}_{T_u}|q_t=m\right\}$$where the constants $c_1$ and $c_2$ can be anything, as long as they are independent of $m$.
A "hard" alignment is one that allows $\alpha_{u,m,t}$ and $\beta_{u,m,t}$ to each be nonzero for only one value of $m$ for each $t$, which we can call $m^*(t)$. Since $c_1(u,t)$ and $c_2(u,t)$ can be anything, a "hard" alignment can be signaled by just making $\alpha_{u,m^*(t),t}=\beta_{u,m^*(t),t}=1$, and $\alpha_{u,m,t}=\beta_{u,m,t}=0$ for $m\ne m^*(t)$.
When we don't yet have any useful segmentation information, a reasonable place to start is by giving every senone the same number of frames:
$$\alpha_{u,m,t}=\beta_{u,m,t}=\left\{\begin{array}{ll} 1 & \frac{(m-1)T_u}{M_u}\le t< \frac{mT_u}{M_u}\\ 0 & \mbox{otherwise}\end{array}\right.$$import numpy as np
Alpha0 = {}
Beta0 = {}
for u in list(data['train'].keys())[:ntrain]:
Y = transcripts['train'][u]['senone'].split()
M = len(Y)
T = data['train'][u].shape[1]
Alpha = np.zeros((M,T))
for m in range(M):
Alpha[m,int(m*T/M):int((m+1)*T/M)]=1
Alpha0[u]=Alpha
Beta0[u]=Alpha
fig, ax = plt.subplots(2,1,figsize=(14,8))
im0=librosa.display.specshow(
invert_mfcc(data['train'][k0])[:,:150],
sr=fs,
hop_length=int(0.01*fs),
win_length=int(0.025*fs),
y_axis='mel',
x_axis='s',
fmax=8000,
ax=ax[0]
)
ax[0].set_title('Melspectrogram of %s: %s...'%(k0,transcripts['train'][k0]['word'][:21]),fontsize=18)
plt.colorbar(im0,ax=ax[0])
im1=ax[1].imshow(Alpha0[k0][:60,:150],aspect='auto')
ax[1].set_xlabel('Time (frames)',fontsize=18)
ax[1].set_ylabel('Senone index',fontsize=18)
N,T = Alpha0[k0].shape
ax[1].set_title('Uniform initial segmentation of %s: %d senones, %d frames'%(k0,N,T),fontsize=18)
fig.tight_layout()
Let $\mathbf{y}_u=[y_1,\ldots,y_{M_u}]^T$ be the sequence of senone labels for the $u^{\text{th}}$ utterance, and let $\mathbf{q}_u=[q_1,\ldots,q_{T_u}]^T$ be the state alignment, where $q_t$ is an integer in the range $1\le q_t\le M_u$.
HMM re-estimation is easiest (and much faster, computationally) if we first convert the alignment probabilities $\alpha$ and $\beta$ to a set of transition posterior probabilities, defined as
$$\xi_{u,m,t,d}=\Pr\{q_t=m,q_{t+1}=m+d|\mathbf{x}_1,\ldots,\mathbf{x}_{T_u}\}$$Usually we assume that every senone in the transcription needs to claim at least one frame in the utterance. With that assumption, from any state $(m,t)$, it is only possible to transition to either $(m,t+1)$ or $(m+1,d+1)$, so the transition posterior probabilities are:
$$\xi_{u,m,t,d}=\begin{cases} \alpha_{u,m,t}a_{y_m,y_{m+d}}\beta_{u,m+d,t+1} & 0\le d\le 1,1\le m, m+d\le M_u,1\le t<T_u\\ \alpha_{u,m,T_u} & d=0,1\le m\le M_u,t=T_u\\ 0 & \mbox{otherwise} \end{cases}$$import importlib,submitted
importlib.reload(submitted)
help(submitted.transition_posteriors)
Help on function transition_posteriors in module submitted: transition_posteriors(Alphadict, Betadict, transcripts, A, ntrain) Calculate the posterior probability of each transition given the training data and its transcript. @param: Alphadict[u][m,t] is the probability of the path ending at the m'th senone at time t Betadict[u][m,t] is the probability of the path starting at the m'th senone at time t transcripts[u]['senone'].split() is a list of the senones in utterance u mfcc[u][:,t] is the t'th MFCC vector in utterance u A[i][j] is probability of a transition from senone i to senone j ntrain is the number of utterances to process @return: xi[u][m,t,d] is the posterior probability of a transition from the m'th to the (m+d)'th senone in utterance u at time t. len(xi)==ntrain.
As a first step, debug using ntrain=1
.
As an inital A
matrix, we can use a matrix of all ones. This violates the laws of probability, but it makes the code easier to debug.
import numpy as np
import importlib, submitted, time
importlib.reload(submitted)
A_ones = {i:{j:1 for j in senones} for i in senones}
processing_start = time.time()
xi0=submitted.transition_posteriors(
Alpha0,Beta0,
transcripts['train'],
A_ones,1
)
print('Processing required %g seconds'%(time.time()-processing_start))
Processing required 0.00750899 seconds
fig, ax = plt.subplots(2,1,figsize=(14,8))
im0=librosa.display.specshow(
invert_mfcc(data['train'][k0])[:,:150],
sr=22050,
hop_length=int(0.01*fs),
win_length=int(0.025*fs),
y_axis='mel',
x_axis='s',
fmax=8000,
ax=ax[0]
)
ax[0].set_title('Melspectrogram of %s: %s...'%(k0,transcripts['train'][k0]['word'][:21]),fontsize=18)
plt.colorbar(im0,ax=ax[0])
im1=ax[1].imshow(xi0[k0][:60,:150,0]+2*xi0[k0][:60,:150,1],aspect='auto')
plt.colorbar(im1,ax=ax[1])
ax[1].set_xlabel('Time (frames)',fontsize=18)
ax[1].set_ylabel('Senone index',fontsize=18)
ax[1].set_title('Transition posteriors (1=self,2=forward)',fontsize=18)
fig.tight_layout()
Once your code is working, try computing xi matrices for ntrain
of the training files.
processing_start = time.time()
xi0=submitted.transition_posteriors(
Alpha0,Beta0,
transcripts['train'],
A_ones,ntrain
)
print('Processing required %g seconds'%(time.time()-processing_start))
Processing required 1.32532 seconds
The three types of model parameters for a Gaussian HMM are the transition probabilities, the state mean vectors, and the state variance vectors, defined as:
$$a_{i,j}=\Pr\{y_{q_t}=j|y_{q_{t-1}}=i\},~~~1\le i,j\le N$$$$\mathbf{\mu}_i=E\left[\mathbf{x}_t|y_{q_t}=i\right],~~1\le i\le N$$$$\mathbf{\sigma}^2_i=E\left[(\mathbf{x}_t-\mathbf{\mu}_i)^2|y_{q_t}=i\right],~~1\le i\le N,$$where $N$ is the number of distinct senones (the size of the list transcripts['senoneset'].split()
), and $(\mathbf{x}_t-\mathbf{\mu}_i)^2$ denotes square each scalar element of the vector. Given the transition posteriors $\xi_{u,m,t,d}$, the model re-estimation formulas are:
where $\sigma^2_{min}$ is a hyperparameter, used to make sure that the variance-weighted feature distances don't get too large.
Notice that each denominator is just the sum of the $a_{i,j}$ numerator, summed across $j$. For this reason, we only need accumulators for the numerators, not for the denominators.
BTW, to make all the denominators equal, we had to assume that there is a transition starting from the last time step. Can you see what extra transition was assumed by this formula?
import importlib, submitted
importlib.reload(submitted)
help(submitted.reestimate)
Help on function reestimate in module submitted: reestimate(xi, senones, transcripts, mfcc, minvar, ntrain) Maximum-likelihood re-estimation of HMM transition probabilities. @param: xi[u][m,t,d] is the posterior probability of a transition from the m'th to the (m+d)'th senone in utterance u at time t. senones is a list of the senone labels transcripts[u]['senone'].split() is a list of the senones in utterance u mfcc[u][:,t] is the t'th MFCC vector in utterance u minvar (scalar): make sure all variances are greater than or equal to this number. Also, use this value for any variance terms whose calculation would cause a divide-by-zero error. ntrain is the number of utterances to process @return: A[i][j] is the probability of a transition from senone i to senone j (i,j strings). If calculating A[i][j] would divide by zero, set A[i][j]=0. mu[i] is the reestimated mean (an np.ndarray) vector for senone i (a string). If calculating mu[i] would divide by zero, set mu[i]=np.zeros(nfeats). var[i] is the reestimated variance vector (an ndarray) for senone i (a string). If calculating var[i] would divide by zero, set var[i]=np.tile(minvar, nfeats). A_numerator[i][j] is the reestimation numerator for transitions from senone i to j, including the transition from time T to time T+1 mu_numerator[i] is the reestimation numerator for the mean vector of senone i var_numerator[i] is the reestimation numerator for the variance vector of senone i
import numpy as np
import importlib, submitted, time
importlib.reload(submitted)
processing_start = time.time()
A0,mu0,var0,A_num,mu_num,var_num=submitted.reestimate(
xi0,
transcripts['senoneset'].split(),
transcripts['train'],
data['train'],
1,
1
)
print('Processing required %g seconds'%(time.time()-processing_start))
Processing required 0.102285 seconds
Now let's test some values.
A_num['$1','$2']
should equal 4, because $1
occurs 4 times in this utteranceA_num['$1','$1']
should be more than A_num['$1','$2']
, because the there are more than 2 frames per senone in this file.A_num['$1','$3']
should be zero, because that transition should never occur.A['o1']['o2']
should be A_num['o1']['o2']/sum(A_num['o1'].values())
sum(A_num['a1'].values())
should be zero, because the phoneme a
does not occur in this training fileA['a1']['a2']
and mu['a1'][0]
should both be zero, but var['a1'][0]
should be minvar
, because those are the default settings specified for variables whose computation would cause a divide-by-zero error.mu['o1']
should be mu_num['o1']/sum(A_num['o1'].values())
var['o1']
should be var_num['o1']/sum(A_num['o1'].values())
o
should have higher energy (zeroth cepstral coefficient) than the meah vector for silent phonemes like $
. It's not certain that this will happen for this initial training, because the starting segmentation is so bad.print('A_num["$1"]["$2"] is',A_num['$1']['$2'])
print('A_num["$1"]["$1"] is',A_num['$1']['$1'])
print('A_num["$1"]["$3"] is',A_num['$1']['$3'])
print('A0["o1"]["o2"] is',A0['o1']['o2'])
print('A_num["o1"]["o2"]/denom is',A_num['o1']['o2']/sum(A_num['o1'].values()))
print('sum(A_num["a1"].values()) is',sum(A_num["a1"].values()))
print('A0["a1"]["a2"] is',A0["a1"]["a2"])
print('mu0["a1"][0] is',mu0["a1"][0])
print('var0["a1"][0] is',var0["a1"][0])
A_num["$1"]["$2"] is 4.0 A_num["$1"]["$1"] is 5.0 A_num["$1"]["$3"] is 0 A0["o1"]["o2"] is 0.3333333333333333 A_num["o1"]["o2"]/denom is 0.3333333333333333 sum(A_num["a1"].values()) is 0 A0["a1"]["a2"] is 0 mu0["a1"][0] is 0.0 var0["a1"][0] is 1
import matplotlib.pyplot as plt
nfeats = len(mu0['o1'])
n = np.arange(nfeats)
fig, ax = plt.subplots(2,2,figsize=(14,8))
ax[0,0].plot(n,mu0['o1'],n,mu_num['o1']/sum(A_num['o1'].values()))
ax[0,0].set_title('Mean of senone o1')
ax[1,0].plot(n,var0['o1'],n,var_num['o1']/sum(A_num['o1'].values()))
ax[1,0].set_title('Variance of senone o1')
ax[0,1].plot(n,mu0['$1'],n,mu_num['$1']/sum(A_num['$1'].values()))
ax[0,1].set_title('Mean of senone $1')
ax[1,1].plot(n,var0['$1'],n,var_num['$1']/sum(A_num['$1'].values()))
ax[1,1].set_title('Variance of senone $1')
Text(0.5, 1.0, 'Variance of senone $1')
Once your code works well, and reasonably quickly, with one training file, try it with ntrain
training files.
import numpy as np
import importlib, submitted, time
importlib.reload(submitted)
processing_start = time.time()
A0,mu0,var0,A_num,mu_num,var_num=submitted.reestimate(
xi0,
transcripts['senoneset'].split(),
transcripts['train'],
data['train'],
1,
ntrain
)
print('Processing required %g seconds'%(time.time()-processing_start))
Processing required 11.3008 seconds
The log probability of a vector $\mathbf{x}_t$ given the $i^{\text{th}}$ senone is
$$\log b_i(\mathbf{x}_t)=-\frac{1}{2}\sum_{d=1}^D\left( \frac{(x_{t,d}-\mu_{i,d})^2}{\sigma_{i,d}^2}+\log(2\pi\sigma_{i,d}^2)\right)$$importlib.reload(submitted)
help(submitted.logB)
Help on function logB in module submitted: logB(mfcc, mu, var, ntrain) Calculate the log observation probability densities for each state at each time assuming a diagonal-covariance Gaussian probability density model. @param: mfcc[u][:,t] is the t'th MFCC feature vector in utterance u (u is a string, t an int) mu[i] is the mean feature vector for senone i var[i] is the vector of feature variances for senone i ntrain is the number of utterances to process @return: logB[u][i][t]=log Pr{mfcc[u][:,t] | senone=i} u and i are strings, t is an int. len(logB)==ntrain.
importlib.reload(submitted)
processing_start=time.time()
logB0 = submitted.logB(data['train'],mu0,var0,1)
print('len(logB1["LJ001-0001"]["$1"])=',len(logB0['LJ001-0001']['$1']))
print('Processing required %g seconds'%(time.time()-processing_start))
len(logB1["LJ001-0001"]["$1"])= 968 Processing required 0.0484869 seconds
We should find that the number of frames is 968.
fig, ax = plt.subplots(1,1,figsize=(14,6))
for i in transcripts['train'][k0]['senone'].split()[:20:3]:
ax.plot(logB0[k0][i][:150],label=i)
ax.legend(fontsize=18)
ax.set_title('Log likelihoods of a few different senones',fontsize=18)
ax.set_xlabel('Time (frame index)',fontsize=18)
Text(0.5, 0, 'Time (frame index)')
fig, ax = plt.subplots(1,1,figsize=(14,6))
Y = transcripts['train'][k0]['senone'].split()
logB_image = np.empty((len(Y),968))
for m in range(len(Y)):
logB_image[m,:]=logB0[k0][Y[m]]
im0=ax.imshow(logB_image)
ax.set_title('logB0 of "%s..."'%(transcripts['train'][k0]['word'][:20]),fontsize=18)
ax.set_ylabel('Senone index',fontsize=18)
ax.set_xlabel('Frame index',fontsize=18)
plt.colorbar(im0,ax=ax)
<matplotlib.colorbar.Colorbar at 0x7fccbeeea0b0>
When your code works correctly and reasonably fast, try applying it to ntrain
training utterances.
importlib.reload(submitted)
processing_start=time.time()
logB0 = submitted.logB(data['train'],mu0,var0,ntrain)
print('Processing required %g seconds'%(time.time()-processing_start))
Processing required 7.6602 seconds
Now that we've initialized the model, and calculated the log observation probabilities, we need to re-align the senones to the utterances. This can be done using the Viterbi algorithm.
Initialize: $$\delta_1(1)=\log b_{y_1}(\mathbf{x}_t)$$ $$\delta_1(m)=-\infty,~~m\ne 1$$
Iterate $$\delta_t(n)=\max_{m\in\{n-1,n\}}\delta_{t-1}(m)+\log a_{y_m,y_n}+\log b_{y_n}(\mathbf{x}_t)$$ $$\psi_t(n)=\arg\max_{m\in\{n-1,n\}}\delta_{t-1}(m)+\log a_{y_m,y_n}+\log b_{y_n}(\mathbf{x}_t)$$
Terminate and Backtrace $$m^*(T_u)=M_u$$ $$m^*(t)=\psi_{t+1}(m^*(t+1))$$ $$\alpha_{u,m,t}=\begin{cases}1&m=m^*(t)\\0&\mbox{otherwise}\end{cases}$$
importlib.reload(submitted)
help(submitted.viterbitrain)
Help on function viterbitrain in module submitted: viterbitrain(A, logB, transcripts, ntrain) Find the maximum likelihood time alignment of each transcript to its MFCC matrix. @param: A[i][j] is probability of a transition from senone i to senone j (i and j are strings) logB[u][i][t]=log Pr{mfcc[u][:,t] | senone=i} u and i are strings, t is an int transcripts[u]['senone'].split() is a list of the senones in utterance u ntrain is the number of utterances to process @return: Alphadict[u][m,t]==1 if m'th senone at time t is in the maximum likelihood path from (0,0) to (M-1,T-1), otherwise 0. len(Alphadict)==ntrain. Delta[u][m,t] is the log probability of the best path up to m'th senone at time t. len(Delta)==ntrain. Backpointer[u][m,t] is either m-1 or m, the most likely predecessor of m at time t. len(Backpointer)==ntrain
importlib.reload(submitted)
processing_start = time.time()
Alpha1,Delta1,Psi1 = submitted.viterbitrain(
A0,
logB0,
transcripts['train'],
1
)
print('Processing required %g seconds'%(time.time()-processing_start))
Processing required 0.644256 seconds
One way to check our code: $\delta_t(m)$ should be $-\infty$ for any $t<m$, because we only allow the HMM to advance by at most one senone per frame.
print(Delta1[k0][:4,:4])
[[ -77.3509607 -155.07789671 -242.35723103 -330.93734968] [ -inf -157.92764231 -241.83262821 -329.31514669] [ -inf -inf -245.11691661 -329.81442507] [ -inf -inf -inf -336.02052783]]
We can also check the calculations, to make sure that $\delta_t(n)$ is either $\delta_{t-1}(n)+\log a_{y_n,y_n}+\log b_{y_n}(t)$ or $\delta_{t-1}(n-1)+\log a_{y_{n-1},y_n}+\log b_{y_n}(t)$, whichever is larger.
logA_test = np.tile(-np.inf,(4,4))
logB_test = np.empty((4,4))
Y = transcripts['train'][k0]['senone'].split()
for m in range(4):
logB_test[m,:] = logB0[k0][Y[m]][:4]
for n in range(4):
if A0[Y[m]][Y[n]]>0:
logA_test[m,n]=np.log(A0[Y[m]][Y[n]])
print('logA[m,n] is:')
print(logA_test)
print('')
print('logB[m,t] is:')
print(logB_test)
print('')
print('For example, delta[1,3] should be either')
print(Delta1[k0][1,2]+logA_test[1,1]+logB_test[1,3])
print('or')
print(Delta1[k0][0,2]+logA_test[0,1]+logB_test[1,3])
logA[m,n] is: [[-0.46833654 -0.98361393 -inf -inf] [ -inf -0.43201381 -1.04754018 -inf] [-2.58215916 -inf -0.30285175 -5.23010544] [ -inf -inf -inf -0.44995724]] logB[m,t] is: [[-77.3509607 -77.25859947 -86.81099779 -88.11178211] [-79.63115263 -79.59306768 -85.77111758 -87.05050467] [-78.93853412 -78.9200743 -86.14173412 -86.93425668] [-83.22614421 -83.09596919 -84.216702 -85.67350578]] For example, delta[1,3] should be either -329.3151466935237 or -330.3913496280543
We should be able to see the same phenomena (a triangle of -np.inf
values at the start of the file; otherwise, values of delta getting gradually more negative as we go from left to right through the utterance) in images of the whole file.
Also, notice that the Alpha1
trajectory should start at the top left, and end at the bottom right.
fig, ax = plt.subplots(3,1,figsize=(14,12))
im0=librosa.display.specshow(
invert_mfcc(data['train'][k0]),
sr=fs,
hop_length=int(0.01*fs),
win_length=int(0.025*fs),
y_axis='mel',
x_axis='s',
fmax=8000,
ax=ax[0]
)
ax[0].set_title('Melspectrogram of %s: %s...'%(k0,transcripts['train'][k0]['word'][:21]),fontsize=18)
plt.colorbar(im0,ax=ax[0])
im1=ax[1].imshow(Delta1[k0][:,:],aspect='auto')
plt.colorbar(im1,ax=ax[1])
ax[1].set_xlabel('Time (frames)',fontsize=18)
ax[1].set_ylabel('Senone index',fontsize=18)
M,T = Delta1[k0].shape
ax[1].set_title('logprob best path %s: %d senones, %d frames'%(k0,N,T),fontsize=18)
im2=ax[2].imshow(Alpha1[k0][:,:],aspect='auto')
plt.colorbar(im2,ax=ax[2])
ax[1].set_xlabel('Time (frames)',fontsize=18)
ax[1].set_ylabel('Senone index',fontsize=18)
M,T = Alpha1[k0].shape
ax[2].set_title('Viterbi segmentation of %s: %d senones, %d frames'%(k0,N,T),fontsize=18)
fig.tight_layout()
As you can see, different implementations of the same equation may lead to small differences (about 0.01%) in the values computed. The max operation is sensitive to such differences, so your Alpha1
may differ slightly from the reference Alpha1
, but your Delta1
should not differ by more than about 0.01% per frame.
When your code seems to be working, and is fast enough to use on the whole dataset, try using ntrain
training files.
importlib.reload(submitted)
processing_start = time.time()
Alpha1,Delta1,Psi1 = submitted.viterbitrain(
A0,
logB0,
transcripts['train'],
ntrain
)
print('Processing required %g seconds'%(time.time()-processing_start))
Processing required 71.3123 seconds
... and if that worked, you might want to try one more complete iteration of Viterbi training.
Alpha
to recompute xi
xi
to recompute the modellogB
logB
to recompute the Viterbi alignmentimportlib.reload(submitted)
processing_start = time.time()
xi1=submitted.transition_posteriors(
Alpha1,
Alpha1,
transcripts['train'],
A0,
ntrain
)
A1,mu1,var1,A_num,mu_num,var_num=submitted.reestimate(
xi1,
transcripts['senoneset'].split(),
transcripts['train'],
data['train'],
1,
ntrain
)
logB1 = submitted.logB(
data['train'],
mu1,
var1,
ntrain
)
Alpha2,Delta2,Psi2 = submitted.viterbitrain(
A1,
logB1,
transcripts['train'],
ntrain
)
print('Processing required %g seconds'%(time.time()-processing_start))
Processing required 91.5484 seconds
fig, ax = plt.subplots(3,1,figsize=(14,12))
im0=librosa.display.specshow(
invert_mfcc(data['train'][k0]),
sr=fs,
hop_length=int(0.01*fs),
win_length=int(0.025*fs),
y_axis='mel',
x_axis='s',
fmax=8000,
ax=ax[0]
)
ax[0].set_title('Melspectrogram of %s: %s...'%(k0,transcripts['train'][k0]['word'][:21]),fontsize=18)
plt.colorbar(im0,ax=ax[0])
im1=ax[1].imshow(Delta2[k0][:,:],aspect='auto')
plt.colorbar(im1,ax=ax[1])
ax[1].set_xlabel('Time (frames)',fontsize=18)
ax[1].set_ylabel('Senone index',fontsize=18)
M,T = Delta1[k0].shape
ax[1].set_title('logprob best path %s: %d senones, %d frames'%(k0,N,T),fontsize=18)
im2=ax[2].imshow(Alpha2[k0][:,:],aspect='auto')
plt.colorbar(im2,ax=ax[2])
ax[1].set_xlabel('Time (frames)',fontsize=18)
ax[1].set_ylabel('Senone index',fontsize=18)
M,T = Alpha2[k0].shape
ax[2].set_title('Viterbi segmentation of %s: %d senones, %d frames'%(k0,N,T),fontsize=18)
fig.tight_layout()
For the last part of the main MP, let's just find a good way to show the force-aligned segmentations. For this purpose, we will use matplotlib's xticks and xticklabels.
importlib.reload(submitted)
help(submitted.show_segmentations)
Help on function show_segmentations in module submitted: show_segmentations(Alphadict, transcripts, ntrain) Generate xticks and xticklabels to show phone segmentations. @param: Alphadict[u][m,t] is the probability of the path ending at the m'th senone at time t transcripts[u]['senone'].split() is a list of the senones in utterance u ntrain is the number of utterances to process @return: xticks[u] is a list of frame indices at which phones start, i.e., for Y[m] that is the first senone of the k'th phone, xticks[u][k] should be the first value of t such that Alphadict[u][m,t]>0.5. Include long-silence phones ($), but not word-boundary phones (#). xticklabels[u] is a list of the corresponding phone strings.
First, let's show the initial uniform segmentations:
importlib.reload(submitted)
xticks, xticklabels=submitted.show_segmentations(
Alpha0,
transcripts['train'],
1
)
print(xticks[k0][:19])
print(xticklabels[k0][:19])
[0, 7, 15, 23, 31, 39, 47, 55, 63, 71, 79, 87, 94, 105, 113, 121, 129, 137, 147] ['$', 'p', 'ɹ', 'ɪ', 'n', 't', 'ɪ', 'ŋ', '$', 'ɪ', 'n', 'ð', 'ɪ', 'o', 'ʊ', 'n', 'l', 'i', 's']
fig, ax = plt.subplots(1,1,figsize=(14,6))
im0=librosa.display.specshow(
invert_mfcc(data['train'][k0])[:,:150],
sr=fs,
hop_length=int(0.01*fs),
win_length=int(0.025*fs),
y_axis='mel',
x_axis='frames',
fmax=8000,
ax=ax
)
ax.set_xticks(xticks[k0][:19])
ax.set_xticklabels(xticklabels[k0][:19],fontsize=18)
ax.set_xlabel('Phone alignment times, iteration 0',fontsize=18)
ax.set_title(transcripts['train'][k0]['word'][:21]+'...',fontsize=18)
Text(0.5, 1.0, 'Printing, in the only...')
Then, after the first round of Viterbi segmentation:
importlib.reload(submitted)
xticks, xticklabels=submitted.show_segmentations(
Alpha1,
transcripts['train'],
1
)
fig, ax = plt.subplots(1,1,figsize=(14,6))
im0=librosa.display.specshow(
invert_mfcc(data['train'][k0])[:,:150],
sr=fs,
hop_length=int(0.01*fs),
win_length=int(0.025*fs),
y_axis='mel',
x_axis='frames',
fmax=8000,
ax=ax
)
ax.set_xticks(xticks[k0][:19])
ax.set_xticklabels(xticklabels[k0][:19],fontsize=18)
ax.set_xlabel('Phone alignment times, iteration 1',fontsize=18)
ax.set_title(transcripts['train'][k0]['word'][:21]+'...',fontsize=18)
Text(0.5, 1.0, 'Printing, in the only...')
After two rounds:
importlib.reload(submitted)
xticks, xticklabels=submitted.show_segmentations(
Alpha2,
transcripts['train'],
1
)
fig, ax = plt.subplots(1,1,figsize=(14,6))
im0=librosa.display.specshow(
invert_mfcc(data['train'][k0])[:,:150],
sr=fs,
hop_length=int(0.01*fs),
win_length=int(0.025*fs),
y_axis='mel',
x_axis='frames',
fmax=8000,
ax=ax
)
ax.set_xticks(xticks[k0][:19])
ax.set_xticklabels(xticklabels[k0][:19],fontsize=18)
ax.set_xlabel('Phone alignment times, iteration 1',fontsize=18)
ax.set_title(transcripts['train'][k0]['word'][:21]+'...',fontsize=18)
Text(0.5, 1.0, 'Printing, in the only...')
In order to transcribe an unknown speech file, the usual method is to apply the Viterbi algorithm, but with one twist: there is no fixed transcription that we need to match. Any senone, $j$, can follow any other senone, $i$, as long as $a_{i,j}>0$.
With this twist, the Viterbi variables need to be indexed by the senone label $i$ rather than the transcription index $m$, because there is no transcription.
Initialize: $$\delta_1(i)= \log b_{i}(\mathbf{x}_t),~~\forall i$$
Iterate $$\delta_t(j)= \max_{i}\delta_{t-1}(i)+\log a_{i,j}+\log b_{j}(\mathbf{x}_t), ~~\forall t,\forall j $$ $$\psi_t(j)= \arg\max_{i}\delta_{t-1}(i)+\log a_{i,j}+\log b_{j}(\mathbf{x}_t), ~~\forall t,\forall j$$
Terminate and Backtrace $$i^*(T_u)=\arg\max_i\delta_{T_u}(i)$$ $$i^*(t)=\psi_{t+1}(i^*(t+1)),~~\forall t<T_u$$ $$\alpha_{u,i,t}=\begin{cases}1&i=i^*(t)\\0&\mbox{otherwise}\end{cases}$$
Although the form of this algorithm is just like Viterbi training, the computational complexity is about 34X higher, because, at each time, for each senone, the number of possible predecessors has increased from 2 to the set of all senones (136).
import importlib, extra
importlib.reload(extra)
help(extra.recognize)
Help on function recognize in module extra: recognize(A, logB, ntest) Find the maximum likelihood transcription of test utterances @param: A[i][j] is probability of a transition from senone i to senone j (i and j are strings) logB[u][i][t]=log Pr{mfcc[u][:,t] | senone=i}, u and i are strings, t is an int ntest: number of utterances to process @return: Alpha[u][i][t]==1 if i is senone in the best path at time t, otherwise 0 Delta[u][i][t] is the log probability of the best path up to senone i (str) at time t (int) Psi[u][i][t] is a string, specifying the most likely predecessor of senone i at time t. Num[u][i][t] is the number of frames that senone i has been repeated on the best path at time t len(hyp)==len(Delta)==len(Psi)==len(Num)==ntest
Let's try to recognize just the first utterance in the train
subcorpus.
importlib.reload(extra)
logB = submitted.logB(
data['train'],
mu1,
var1,
1
)
Alpha, Delta, Psi = extra.recognize(
A1,
logB,
1
)
bestpath = []
u = list(Alpha.keys())[0]
for t in range(len(Alpha[u]['$1'])):
for i in transcripts['senoneset'].split():
if Alpha[u][i][t]==1:
bestpath.append(i)
print('There were %d frames in the best path'%(len(bestpath)))
recognized = '$'
for s in bestpath:
if s[0] != recognized[-1]:
recognized += s[0]
print(recognized.replace('$',' ').replace('#',' '))
There were 968 frames in the best path θ ɐ tɪn ɪndeɪlɪsɛnz ðə wɪʃ ɪ oɹɛpɛz ns ðɚnt sɛfɹɛz fɑmɑm stɪʃənaʊfɑmɔl ɪ ɑɹ ksᵻkæf kstɹɛz ɛntᵻtɪni ɛksɪʃn
Well! It's not correct, but some of the strings sound similar to the correct utterance (e.g., the last word is "exhition" instead of "exhibition"). With more training iterations we should converge to 100% accuracy on the training corpus. Then, with more training data, we might start to get good performance on a test corpus.
The tests tests/test_visible.py
and tests/test_extra.py
are both run by the grade.py
code, so you can test both by running python grade.py
. Once you confirm that you pass the tests on your own machine, upload your code to Gradescope!