Using generative modelling to produce varied intonation for speech synthesis

The following speech samples relate to a paper submitted for review at the Speech Synthesis Workshop 2019

Systems are as follows (as described in the paper),

System name	System description
RNN	Standard RNN-based SPSS model, using MSE.
MDN	MDN with 4 mixture components, using NLL.
VAE–MEAN	VAE decoder using z_MEAN, i.e. the zero vector.
VAE–TAIL	VAE decoder using z_TAIL with r = 3, i.e. points sampled uniformly on the surface of a hyper-sphere with radius 3.
COPY–SYNTH	Natural F0.
BASELINE	A quadratic polynomial fitted to natural F0.
RNN–SCALED	F0 from RNN, scaled vertically by a factor of 3.

Listening test stimuli

The following are four sentences from the listening test (same stimuli for the naturalness and variedness tests).
If you click an audio item the image will change to show the F0 prediction (in red) and the Savitzky-Golay smoothed F0 prediction (in blue). The black contour is the ground truth F0.
The BASELINE system was not included in the variedness test, but it is clear that this would be perceived as most flat.
The final row of audio clips for each sentence gives multiple samples for VAE–TAIL (where the first of sample was used in the listening test).
In the paper we visualised the range of F0 contours produced by VAE–TAIL (Figure 6 in the paper). This density plot is shown below each set of audio clips.

Morgana viz