MP6: AutoVC

In this MP, you will construct neural network layers using PyTorch for use with the system described in "AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss" (ICML 2019) by Kaizhi Qian et al. (of the professor's own group!). The file you actually need to complete is submitted.py. The unit tests, provided in tests/test_visible.py, may be run using grade.py.

PyTorch basics

For this MP, we will be primarily using the PyTorch machine learning framework.

Compared to other frameworks presently used for machine learning, PyTorch has achieved wide currency across disciplines, both in research and in production, for the great depth in its repertoire of functionality and for how much is handled automatically by the library (including and especially its simple automatic gradient calculation interface). This leaves for the user few extra requirements for implementing a given machine learning model.

A number of comparisons to NumPy may hint at the ease with which PyTorch may be used. To start with the basics, just as the primary object of manipulation in NumPy is the N-dimensional array (ndarray), PyTorch's object of choice is the N-dimensional tensor.

The two behave very similarly, since many methods used in PyTorch are designed with their direct equivalents in NumPy in mind:

Of course, as useful as these simple functions may be, they're certainly not the entire story.

A typical PyTorch model consists of, at minimum, a class with two procedures:

The layers of a neural network are all typically organized into modules, which when combined form a graph of computations (think the graph in lecture 22 slide 24) based on which gradients can be computed:

Since each of the layers takes in exactly one input tensor and returns exactly one output tensor, a more concise way to define the class above might be as follows:

We can of course directly manipulate an instance of this module after having constructed it, for instance when needing to load saved parameters:

To obtain an output from this model, we can just call it with an input (the arguments after self in the forward method):

We can use any of the loss functions provided in torch.nn to obtain a metric for model performance. To then obtain the gradients of this loss, all we need to do is call the model's backward method:

AutoVC

AutoVC, put very simply, is a zero-shot style transfer autoencoder for voice conversion.

(There's a lot to unpack in that sentence, so read on...)

"style transfer"

The primary assumption that AutoVC makes is that any given speech utterance is dependent on two parts, each separately distributed (Sec. 3.1):

1) a content-specific component, corresponding roughly to the information about a sentence that would be captured in a textual transcription, and 2) a speaker-specific component, imparting information about how a given individual vocally produces that sentence.

It is important that an utterance converted to use the speaker-specific information of a target speaker sound as much like that target speaker as possible, while maintaining constant content-specific information (Eq. 2). For this to be achieved, therefore, it must be possible to disentangle both of these components readily.

"autoencoder"

An autoencoder (Fig. 1) is a combination of an encoder network and a decoder network, the output of the former serving as the input of the latter. It is often used to learn a lower-dimensional representation or 'embedding' of a given piece of data; because some information is lost in this dimension reduction, the reduction may be considered an information 'bottleneck'. An autoencoder is often trained to approximate its input as closely as possible, improving the quality of the embedding in the process.

AutoVC's encoder attempts to output an embedding for the content-specific component of an utterance by one speaker. This output, together with a similarly produced speaker-specific embedding from an utterance by another speaker, are then fed into AutoVC's decoder to yield a converted utterance. It is the size of the bottleneck (Sec. 3.3) that is tuned to ensure that the content embedding contains as little residual information about the first speaker as possible.

"zero-shot"

Most, if not all, prior voice conversion attempts require that the source and the target be known to the system in the training process. AutoVC, however, is able to handle speakers that it did not encounter in training. This ability stems largely from the speaker embedding being more than just a one-hot vector, since there is a separate encoder trained to generate it.

Trying it yourself

Provided for you are two files, source_utterance.flac and target_utterance.flac. Once you have completed the main MP, you can try AutoVC out, attempting to convert the voice in the source utterance into the voice in the target utterance, by running python _main.py and viewing the file converted_utterance.flac. You may also specify the files to use manually (for instance, transferring the voice in a.wav into b.wav and saving the result in c.wav), by running python _main.py b.wav a.wav c.wav.

What to deliver

This MP primarily consists of the implementation of different PyTorch modules. Some of them will correspond to existing PyTorch layers, while others are direct re-implementations of AutoVC components. (Most of the code you don't have to write is adapted from Kaizhi's original code, as well as an adjusted version thereof.)

Each function to write has type hints in its signature for both inputs and output. In the line def f(x: int, y: float) -> str:, x is an integer, y is a floating-point number, and the output from calling f(x,y) is a string.

The hints used for individual tensors are supplied by the torchtyping package, which can make PyTorch code you write somewhat easier to understand since it allows you to specify information about dimensions. Here's a brief summary of what you will encounter in the hints:

1. Linear layers

The first Module you will design is a simple linear, fully-connected layer, appropriately named LineEar.

2. Long Short-Term Memory

The next Module you will design is an LSTM module, appropriately named EllEssTeeEmm, constructed from torch.nn.LSTMCell modules. In addition to handling more than one layer, it needs to handle an optional bidirectional mode:

3. Gated Recurrent Units

The GRU was introduced in Kyunghyun Cho et al. "Learning Phrase Representations using RNN Encoder–Decoderfor Statistical Machine Translation". It omits the output gate of the LSTM and adjusts the remaining gates so that the hidden state at time $t$ is an affine combination of the state at time $t-1$ (scaled by an update gate—like the LSTM's forget gate) and a new state (scaled by a reset gate—think the LSTM's memory cell but with an input scaled by the input gate). Although theoretically the GRU is somewhat limited in terms of its processing of much longer sequences when compared to the LSTM, it has been shown to rival the LSTM (and in some cases outperform it) in many of the same experiments, with a noticeable reduction in training time and complexity.

In addition to the LSTM layer, you will also design a GRU module, appropriately named GeeArrYou, constructed from torch.nn.GRUCell modules. While this does not need to handle a bidirectional mode, it still needs to handle multiple GRU layers, and in addition it does need to handle an optional dropout value.

4.1. AutoVC's Content encoder

The first of the AutoVC modules you will be implementing is the content encoder (Fig. 3(a)), appropriately named Encoder.

Note that you do not need to handle the concatenation of speaker embedding and spectrogram at the beginning, nor the dimensionality reduction at the end, as these are performed for you in _main.py.

4.2. AutoVC's decoder

The second of the AutoVC modules you will be implementing is the decoder (much of Fig. 3(c)), appropriately named Decoder.

4.3. AutoVC's decoder post-network

The third of the AutoVC modules you will be implementing is the decoder post-network (part of Fig. 3(c)), appropriately named Postnet.

4.4. Speaker Embedder

The last module you will implement is a speaker embedding encoder, appropriately named SpeakerEmbedderGeeArrYou. This is not exactly the same such encoder used in the original AutoVC, but is simplified somewhat by the use of GRUs.

Extra Credit: Custom LSTMCells and GRUCells

As an extra credit option, you may implement your own LSTM or GRU cell classes (EllEssTeeEmmCell and GeeArrYouCell). These should have the same parameters as LSTMCell and GRUCell respectively, and should necessarily exhibit the same behavior as them in forward propagation. Note that your implementation of EllEssTeeEmm and GeeArrYou must be unchanged from that of the main MP except for the substitution of the PyTorch cell classes with your own.

To check the behavior of your cell classes, it is enough to add from extra import * to the list of imports in submitted.py, substitute LSTMCell with EllEssTeeEmmCell and GRUCell with GeeArrYouCell wherever they occur, and reload the notebook/run grade.py as before.

Extra Credit: Transferring your voice

As an extra credit option, you may record an utterance (about five to seven seconds in length, as 16KHz WAV) and attempt to transfer to it and from it an utterance from the VCTK corpus.

In both cases, half that amount will be awarded if the voice transfer is evident but the resulting utterance is unintelligible.

Alternatively, you and a partner may record different 5-7 second utterances and attempt to transfer them over between each other. A full ten points will be awarded if both directions are intelligible; five points will be awarded if one direction is intelligible but the other direction has issues.

(These point valuations may be adjusted upward if intelligibility issues persist across attempts at this task.)

Caveats for this MP