MP1: LPC

In this lab, you'll use linear predictive coding (LPC) to analyze and then resynthesize a speech sound.

In order to make sure everything works, you might want to go to the command line, and run

pip install -r requirements.txt

This will install the modules that are used on the autograder, including soundfile, numpy, h5py, and the gradescope utilities.


Part 1: Plotting and understanding the spectrogram

First, let's load a speech waveform. This is an extract from Nicholas James Bridgewater's reading of the Universal Declaration of Human Rights (https://librivox.org/universal-declaration-of-human-rights-volume-03-by-united-nations/).

LPC-10 synthesis was designed for an 8kHz sampling rate, so this file has been resampled to an 8kHz sampling rate.

Let's zoom in on part of it.

Let's also look at a spectrogram of the first 1.5 seconds or so, using a window length that's much shorter than one pitch period, so you can see the vertical striations corresponding to glottal closure instants. For this we'll use librosa. Note: librosa is not available on the autograder, so don't get too dependent on it.

Just to make sure we know what's going on here, let's listen to only the same part of the utterance.

Notice how the vowels show up as strongly voiced regions (with vertical striations once per glottal pulse). The /sh/ in the middle shows up as a loud fricative burst at around 0.8 seconds. The /k/ , /g/, and the pause between words show up as silences at 0.4, 0.6, and 1.25 seconds, respectively.


Part 2: Chop it into frames

At this point, we'll load the file submitted.py.

The file submitted.py is the only part of your work that the autograder will see. The only purpose of this notebook is to help you debug submitted.py. Once you have revised submitted.py enough to make this notebook work, then you should go to the command line, and type python run_tests.py. Once that command returns without errors, then you can go ahead and submit your file submitted.py to the autograder. You can submit to the autograder as often as you want, but it will save you trouble if you debug as much as you can on your local machine, before you submit to the autograder.

We will use importlib in order to reload your submitted.py over and over again. That way, every time you make a modification in submitted.py, you can just re-run the corresponding block of this notebook, and it will reload submitted.py with your modified code.

Since the file is called submitted.py, python considers it to contain a module called submitted. As shown, you can read the module's docstring by printing submitted.__doc__. You can also type help(submitted) to get a lot of information about the module, including its docstring, a list of all the functions it defines, and all of their docstrings. For more about docstrings, see, for example, https://www.python.org/dev/peps/pep-0257/.

Now it's time for you to open submitted.py, and start editing it. You can open it in another Jupyter window by choosing "Open from Path" from the "File" menu, and then typing submitted.py.

Once you have it open, try editing the function make_frames so that its functionality matches its docstring. Here is what it's docstring says:

If this is working, then dividing the speech signal into 30ms frames (240 samples), with a 15ms hop (120 samples), should yield 1189 frames.


Part 3: Calculating autocorrelation of each frame

Now write the function submitted.correlate, to satisfy its docstring:

The autocorrelation of frame $x[n]$ is given by

$$r_{xx}[m] = \sum_n x[n] x[n+m]$$

You can compute this efficiently using https://numpy.org/doc/stable/reference/generated/numpy.correlate.html, but be sure to specify mode='full'.

Notice that the autocorrelation coefficient $R[0]$ is right at the center of each row (at autocor[:,win_length-1]), and the autocorrelation is symmetric. So for example, autocor[0,win_length-3:win_length+2] should equal

[0.000736 0.00078295 0.00081344 0.00078295 0.000736 ]

Part 4: Make matrices

Now that you've computed the autocorrelation of each frame, you need to re-arrange it into matrices, so that you can find the LPC coefficients.

Remember how you these matrices are defined:

$$R = \left[\begin{array}{cccc} R[0] & R[1] & \cdots & R[p-1]\\ R[1] & R[0] & \cdots & R[p-2] \\ \vdots & \vdots & \ddots & \vdots \\ R[p-1] & R[p-2] & \cdots & R[0]\end{array}\right],~~~ \gamma = \left[\begin{array}{c}R[1]\\R[2]\\\vdots\\R[p]\end{array}\right]$$

The only thing you have to be careful about is that, in the $t^{\textrm{th}}$ frame, $R[0]$ is at the center of autocor[t,:]. So you have to figure out where the center is, and align things accordingly. If you get everything working, then with $p=10$, autocor[0,:6,:6] should look like this:

In frame 0, R is 
[[0.00081344 0.00078295 0.000736   0.00069581 0.00066528 0.00063886]
 [0.00078295 0.00081344 0.00078295 0.000736   0.00069581 0.00066528]
 [0.000736   0.00078295 0.00081344 0.00078295 0.000736   0.00069581]
 [0.00069581 0.000736   0.00078295 0.00081344 0.00078295 0.000736  ]
 [0.00066528 0.00069581 0.000736   0.00078295 0.00081344 0.00078295]
 [0.00063886 0.00066528 0.00069581 0.000736   0.00078295 0.00081344]]

and gamma is
[0.00078295 0.000736   0.00069581 0.00066528 0.00063886 0.00060859]

Part 5: Calculating the LPC coefficients

Now you should calculate the LPC coefficients in each frame!

Remember that the normal equations result in the following easy form:

$$a = R^{-1}\gamma$$

Here's a neat trick we can use, in order to see if the LPC coefficients we've calculated are reasonable. Remember that the LPC synthesis filter has the following frequency response:

$$H(\omega) = \frac{G}{1-\sum_{k=1}^p a_k e^{-jk\omega}}$$

If we calculate $H(\omega)$ explicitly, for every frame, then we can show it as if it were a spectrogram. It should look similar to the spectrogram of the original speech signal. It won't have the right amplitude (because we don't know $G$ yet), but it should have the right general shape.

Remember that we're using a 15ms hop_length, so, to match the 1.5 seconds that we plotted earlier, we'll just do this for the first 100 frames.


Part 6: Calculating the pitch period

In order to synthesize speech, we need to calculate the pitch period in each frame. In frames that are not voiced, we will signal that by setting framepitch to zero.

Let's see how this will work. If you look at the spectrogram above, you can see that there is a vowel at about $t=0.3$ seconds, i.e., frame number 20. There is a loud unvoiced fricative just before $t=0.9$s, i.e., frame numbers just before 60. Let's plot those autocorrelations, to compare them. In order to make the comparison easier, we'll normalize each of the two autocorrelations by their $R[0]$ values:

$$C[m] = \frac{R[m]}{R[0]}$$

You can see that the voiced frame has a strong peak at about 80 samples (about $80/8000=10ms$), which seems likely to be the pitch period. The unvoiced frame has a lot of energy at very high frequencies, but for periods higher than about 20 samples ($20/8000\approx 2.5ms$), the energy all has a normalized autocorrelation of $|C[m]|<0.3$ or so. So we'll use the following strategy:

Here is what the result looks like, for the first 100 frames. You will notice that the voiced/unvoiced (V/UV) decision is actually not very good. We would do better if we used more features for the V/UV decision, but this will be good enough for the purposes of this MP.


Part 7: Calculating the spectral level in every frame

Spectral level is defined to be the average spectral power, expressed in decibels:

$$L = 10\log_{10} \frac{1}{N} \sum_{n=0}^N x^2[n]$$

Let's calculate that for each frame.

If we plot this for the first 100 frames, we should see levels that match the brightness of the corresponding frames in the spectrogram.


Part 8: Linear interpolation of the pitch and level

OK, now we're finally done with the analysis! We've converted the input speech into a small set of parameters per frame:

Given those three parameters per frame, now we want to start the process of resynthesizing the speech signal.

The first thing we need to do is to linearly interpolate the pitch period and the spectral level between frames.

If we plot the first 100 frames of framelevel and framepitch, and compare it to the first 100*hop_length samples of samplelevel and samplepitch, they should look basically the same.

Notice, there is one difference between them: the frame pitch, on top, linearly interpolates between zero and large values (because pyplot linearly interpolates by default), whereas the sample pitch does not.


Part 9: Create the LPC Excitation Signal

The LPC excitation signal is

$$e[n] = \left\{\begin{array}{ll} \mbox{Zero-mean Gaussian white noise}& \mbox{unvoiced}\\\mbox{Impulse train}&\mbox{voiced}\end{array}\right.$$

There are two difficult points, here:

  1. First, in order to keep track of when you should generate an impulse, you need to keep track of the phase of the fundamental frequency. Remember that phase is just the integral of frequency, so it's given by
$$\phi[n] = \phi[n-1] + 2\pi F_0[n]$$

where $F_0[n]=1/T_0[n]$ is the pitch frequency in cycles/sample (or $F_0[n]=0$ if $T_0[n]=0$). Every time $\phi[n]$ passes $2\pi$, we do two things: (1) Subtract $2\pi$, and (2) output an impulse in the excitation signal.

  1. Second: what scalar should you multiply times each sample of $e[n]$ so that the local average level is always samplelevel[n]? Notice that your answer is different for white noise (where each sample is nonzero) versus an impulse train (where very few samples are nonzero). For white noise, the answer is the RMS signal amplitude, $a_{RMS}$, which is related to the level by just inverting the dB formula: $$a_{RMS} = \sqrt{10^{L/10}}$$ For an impulse train, you also need to somehow account for the fact that only $1/T_0$ of the samples are nonzero. This gives you a delta-function amplitude, $a_{delta}$, of $$a_{delta}=a_{RMS}\sqrt{T_0}$$

Here's what the phase looks like. Notice that during unvoiced regions, it remains constant. During voiced regions, it rotates from $\phi=0$ to $\phi=2\pi$ once every pitch period.

And here's what the excitation signal looks like. Notice that there's an impulse every time the phase returns to zero. During unvoiced frames, it is low-amplitude white noise.


Part 10: LPC Synthesis

Now we have the excitation signal $e[n]$, and we have the LPC coefficients $a_k$, so we can use those things to resynthesize the speech signal!

The synthesis formula is really just the standard direct-form IIR filter formula:

$$y[n] = e[n] + \sum_{k=1}^p a_k y[n-k]$$

Grade your code on your own machine before submitting it!

If you reached this point in the notebook, then probably your code is working well, but before you run the autograder on the server, you should first run it on your own machine.

You can do that by going to a terminal, and running the following command line:

python grade.py

If everything worked perfectly, you'll get a message that looks like this:

......... ---------------------------------------------------------------------- Ran 9 tests in 6.192s OK

But suppose that something didn't work well. For example, suppose you run python grade.py, and you get the following:

........F ====================================================================== FAIL: test_synthesize (test_visible.TestStep) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/jhasegaw/Dropbox/mark/teaching/ece417/ece417labs/21fall/mp1/src/tests/test_visible.py", line 159, in test_synthesize self.assertAlmostEqual( AssertionError: 0.01161102437459162 != 0.02322204874918324 within 7 places (0.01161102437459162 difference) : *** y[8252] should be 0.01161102437459162 ---------------------------------------------------------------------- Ran 9 tests in 6.533s FAILED (failures=1)

This error message means that the function test_synthesize, in the file tests/test_visible.py, failed (all of the other tests succeeded -- only that one failed).

The error message specifies sample $y[8252]$ should have been $0.01161102437459162$, but instead, it was $0.02322204874918324$.

How can you debug this?

The autograder works as follows:

  1. Load reference inputs and outputs of each function from the file solutions.hdf5
  2. Put in the reference inputs, and get a hypothesis output.
  3. Compare the reference and the hypothesis at a few sample points. If they differ by more than one part in $10^7$, then call it an error.

You are strongly encouraged to load the file solutions.hdf5, and use it to debug your code. Here's how to do it:

As you can see, this file contains a lot of objects, created during a sample run of the solution code with random parameters. Let's load the reference inputs and outputs of synthesize, compute a hypothesis output, and compare them in the neighborhood of sample $8252$:

Right away, we notice that the expected solution (solid) is half the amplitude of the solution being generated by my file! Upon inspection, I discover that I accidentally multiplied the excitation by $2$ at the input to the filter. Fixing that line in submitted.py fixes all of the results.

When gradescope grades your code, it will run grade.py. It will test your code using the solutions in solutions.hdf5, and using the test code tests/test_visible.py. It will also test your code using some hidden tests. The hidden tests are actually exactly the same as those in tests/test_visible.py, but with different input parameters.


Extra Credit

You can earn up to 10% extra credit on this MP by finishing the file called extra.py, and submitting it to the autograder.

When you unpack the file mp1_extra.zip, it will give you the following files:

If you look in the file tests/test_extra.py, you find the following grading scheme:

Hint: if you just implement the same algorithm that was used in submitted.framepitch, you won't get full credit, but you'll do pretty well.

Other features to think about include:

If you have a large external data source with labeled voiced/unvoiced frames, you could try training a neural net using the spectrogram. I don't think that's necessary, though; with a three-dimensional feature vector, you should be able to tweak a VAD on extra_train.wav so that it gets low error on extra_test.wav.

When you think you have it working, you can test it by running:

python grade.py

in your terminal. Yes, indeed, this is the same code you ran for the regular MP! The only difference is that, when you unzipped mp1_extra.zip, it gave you the test file tests/text_extra.py. So now, when you run grade.py, it will grade both your regular MP, and the extra credit part.

Having implemented only the default algorithm from submitted.frame_pitch, here's what I get already:

..F.......... ====================================================================== FAIL: test_extra_better_than_6_percent (test_extra.TestStep) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/jhasegaw/Dropbox/mark/teaching/ece417/ece417labs/21fall/mp1/src/tests/test_extra.py", line 33, in test_extra_better_than_6_percent self.assertLess(errorrate, 0.06) AssertionError: 0.07996223039459298 not less than 0.06 ---------------------------------------------------------------------- Ran 13 tests in 6.234s FAILED (failures=1)

... so in case you didn't notice, that's already pretty good.

Congratulations! That's the end of MP1. Good luck!