Implementation of Capacitron - an expressive text-to-speech VAE model
A master's thesis project

Voder: A speech synthesizer from 1930 by Bell Labs [Pinterest]
This post is a short and selected technical summary of my implementation of an expressive text-to-speech model from Google as part of my master's thesis at TU Berlin.
As my submitted thesis is available online, this post is a hybrid work showing some audio examples from the model as well as some interesting technical implementation details that might be of interest to the reader.
Brief summary
As part of my master's thesis, I successfully implemented the first open-source realization of the Capacitron VAE extension of Google's standard Tacotron 1 system in Coqui AI's TTS library. Due to the modular design of the VAE module of the Prosodie encoder, this encoder can now also be used with the Tacotron 2 architecture, which significantly improves stability and quality and is thus an extended, improved and open-source version of the original method presented in the article by Battenberg et al [5]. By providing training recipes, two pre-trained Capacitron models and a pre-trained HiFiGAN vocder model based on Catherine Byers' Blizzard2013 dataset within Coqui AI's "Model Zoo", the community can now immediately experiment with and train new expressive Tacotron-based systems.
Choice of topic and motivation
In an article by Aide-Zade et al [1] from 2013, two criteria were defined that can be used to evaluate synthetic speech production machines. The first of these is "speech intelligibility", which can be roughly described by the question "What is being uttered?". Earlier speech synthesis techniques such as concatenative and parametric technologies are significantly outperformed by newer deep learning-based models on a scale of subjective evaluation of quality and naturalness called the Mean Opinion Score. These newer technologies are very close to closing the gap between the subjective perceptual quality of synthetic and real human speech so that they meet the speech intelligibility criteria, as shown in Figure 1.

The second criterion is the "naturalness of sound", which can be roughly defined by the question "How is a particular text request pronounced?". Speech synthesis is an underdetermined problem, since a single text request can have many different realizations as an utterance. These realizations are embodied by intonation, stress, rhythm and speaking style - we refer to these aspects collectively as "prosody". Controlling these attributes in a way that allows us to synthesize custom, expressive synthetic speech is one of the holy grails of this technology and a heavily researched topic. Furthermore, such systems would also alleviate the one-to-many mapping problem of speech synthesis, where previous methods could only synthesize a given text prompt in a single way.
When I listened to some of the samples on the Google TTS Research Team website a few years ago, I discovered an incredibly impressive set of audio samples that enabled the precise transfer of prosody between a reference speech signal and a synthesized utterance. Some of these samples sounded so stunning that I decided I wanted to be able to understand and manipulate this system.
Tacotron based VAE

In Figure 2, we see the standard seq2seq Tacotron text-to-speech architecture, where the model takes text as input and converts it into character embeddings, which are then passed through the many different layers to produce output spectrogram frames - these frames can then be converted to audio using Griffin-Lim reconstruction or neural vocoders.

In Figure 3, we see an extension of this standard system, where a separate subnet takes in reference spectrogram slices to generate a single prosody embedding, which is then linked to the character embeddings. During training, the reference spectrogram is the actual spectrogram that the model attempts to reconstruct. We can thus stimulate this reference encoder network to learn a representation of the prosody space of the input speaker if we parameterize it well enough. It is precisely a unique parameterization of this network architecture that this paper is concerned with - a variational autoencoder extension of the system shown on the right, called Capacitron [5].

Figure 4 shows a standard autoencoder network that works as a kind of compression algorithm - the input data is passed through an encoder network to create a fixed, compressed vector z, which is then decoded by the decoder network to reconstruct the original input.

In Figure 5, we see an extended version of this architecture, where the encoder no longer outputs a compressed and fixed representation of the input, but generates parameters for a distribution - here, for example, mu and sigma could be parameters of a multivariate standard normal distribution, which actually forms an approximated posterior distribution q(z|x). Then we can sample from this distribution to generate the actual latent embedding - this is no longer fixed, so we can introduce a kind of controlled randomness into the network with this architecture.
The VAE architecture allows 3 different ways to derive a trained model. First, when we feed a reference spectrogram to the network, we ask the encoder to retrieve the prosody embedding from that reference and either apply it to the same text as in the reference - which gives us Same-Text Prosody Transfer (STT) - or we can derive arbitrary text and thereby transfer the general style from a reference to an entirely new text - we call this Inter-Text Style Transfer (ITT).
The second way to derive the model is not to input a reference, but to make the model sample from the previous distribution of the latent space. This creates a true generative model where a realistic but random prosody is sampled each time we initiate synthesis. This eliminates the one-to-many mapping problem of the standard Tacotron system.
You can listen to some of these examples in these two videos:
Model loss Function
Without going too much into the details of how VAEs work (you can read more about the theory in Chapter 3 of my dissertation ), I will now introduce the extended loss function of the Capacitron model and give an intuitive idea of how the capacity of this model can control the depth of expression of the synthesized speech.

The first term of Equation 1 is the expected reconstruction loss of the generative model, where we use the fundamental loss of the Tacotron L1 decoder as a proxy for the negative log-likelihood of a generated spectrogram x given latent embedding z and text y_{T}. The second term of Equation 1 is the extended KL term calculation between the approximate posterior and prior conjecture, including the auto-tuned constant beta, which is similar to a Lagrange multiplier, and the variational capacity limit C. We constrain beta to be nonnegative by passing an unconstrained parameter through a softplus nonlinearity, making the capacity constraint a limit rather than a target [5].
The two goals of this loss function are to reduce the decoder loss of the standard Tacotron model and to optimize the KL divergence between the approximated posterior and prior distributions (i.e., the upper bound on the mutual information between the data X and the latent prosody space Z) to a desired variational capacity value C. The higher the C value, the more the data X and the latent prosody space Z are mutually informative. The higher the C value, the more the reference encoder is encouraged to encode information from the prosody space Z, which means that by controlling the upper bound of the mutual information between the data and the latent space, we can control the degree of expressiveness that the prosody encoder learns from the data. On the other hand, the structural capacity of the VAE architecture can be easily adjusted by changing the dimensionality of the latent embedding z. Setting this dimensionality forces the reference encoder to map the latent space to a fixed tensor, the size of which can be manipulated to control how large the variation embedding of the network should be.
Various implementation details
This section presents some selected examples of model implementation that may be of interest to the reader.
Double optimization
A large part of the implementation focused on the double optimization routine described in [5]. The loss function defined in Equation 1 should be optimized by two separate processes. The main ADAM optimizer is used for all model parameters except the single scalar parameter beta, while a separate SGD optimizer is used for beta only. This double optimization was performed by splitting the loss function into two separate assignments, each containing only the parameters to be minimized. PyTorch's .detach() method was used to separate parameters from the automatic differentiation engine. The implementation of this routine can be found in the TacotronLoss class.
Convolution masking
One of the crucial aspects of the implementation is a convolutional network feature not explicitly mentioned in the reference encoder. Basic convolutional networks typically take an input and compress it to a desired output by using different values for the kernel_size and stride parameters, which define the receptive field and the sliding of the convolutional filters. These base networks work with fixed input sizes - usually static size images - on which they perform the convolutional steps to reduce and compress the data.
A unique feature of working with speech data is that the audio fed to the networks has different lengths. Within a training batch, these individual samples of different lengths are sorted in descending order so that samples of similar length follow each other. However, the longest sample in the batch always defines the final size of the input sensor within a particular training step. Other, shorter samples are naturally filled with zeros to match the size of the longest sample so that the input sensor has a uniform shape.

Within the ReferenceEncoder module, this input sensor with the zero-filled instances is passed through a stack of 6 convolutional layers with 3x3 filters, 2x2 step size and batch normalization. The 6 layers have 32, 32, 64, 64, 128 and 128 filters respectively. To demonstrate why this process requires special attention, Figure 7 shows a simple 1D example of an input tensor with input_length=5. PyTorch's convolution module allows additional zero padding to be specified on different axes of the input tensor, in this case a zero padding of 2 is shown on the width dimension. This is to ensure that the receptive field of the convolution filter correctly contains the information on the edges of the input tensor. With filter_width=3 and step_width=2, we see that the filter uses 3 input values to convolve these values into a single output value. The filter then shifts 2 values along the width axis to include the next triple of values, and so on. This singular convolution operation now compresses the input tensor from input length = 5 to output length = 4.

The previous case obviously didn't need to be treated any differently than normal convolution operations - PyTorch has all the built-in helper functions and methods to account for data compression. However, in our particular case of working with variable length audio data, Figure 8 shows where this process needs to be considered in more detail. In this case, the same valid_length sample is in a different batch where it is no longer the longest sample, so zero padding has already been applied at the end of the signal. In this particular case, the input is already padded with two zero values by the data feed network. With the same convolution operation as described above, the output of this convolution will no longer have length 4, but 5. The last filter applies convolution to zero values, but due to the distortion, these terms will not be zero after the convolution step. Even if they are small, in our case of a stack of 6 such convolutions, the amount of invalid information makes this convolutional net essentially incapable of properly processing the input data and producing meaningful output during test time.

The solution to this problem is to calculate the valid length of each instance in the batch after a single convolution pass and mask out any invalid values before feeding this output back into the next convolution. With this convolutional masking, shown in Figure 9, the variable-length input audio is properly downsampled into a valid, compressed representation. This implementation detail proved to be an essential aspect of the Capacitron method - models trained without this convolutional masking were unable to learn the latent space of the input data at all and produced unintelligible audio during test time.
See the CapacitronLayers class to learn how this masking is implemented.
Referenzen:
[1] : Aida–Zade et al.: The Main Principles of Text-to-Speech Synthesis System. 2013
[2] : WaveNet: A generative model for raw audio blog post. deepmind.com/blog/ article/wavenet-generative-model-raw-audio
[3] : Wang et al.: Tacotron: Towards End-to-End Speech Synthesis. 2017
[4] : Skerry-Ryan et al.: Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron. 2018
[5] : Battenberg et al.: Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis. 2019