Using generative modelling to produce varied intonation for speech synthesis

The following speech samples relate to a paper submitted for review at the Speech Synthesis Workshop 2019

Systems are as follows (as described in the paper),

System name System description
RNN Standard RNN-based SPSS model, using MSE.
MDN MDN with 4 mixture components, using NLL.
VAE–MEAN VAE decoder using zMEAN, i.e. the zero vector.
VAE–TAIL VAE decoder using zTAIL with r = 3, i.e. points sampled uniformly on the surface of a hyper-sphere with radius 3.
COPY–SYNTH Natural F0.
BASELINE A quadratic polynomial fitted to natural F0.
RNN–SCALED F0 from RNN, scaled vertically by a factor of 3.

Listening test stimuli

Morgana viz

GoldilocksAndTheThreeBears_001_001

COPY–SYNTH BASELINE RNN–SCALED
RNN MDN VAE–MEAN
VAE–TAIL(1) VAE–TAIL(2) VAE–TAIL(3) VAE–TAIL(4)

GoldilocksAndTheThreeBears_006_002

COPY–SYNTH BASELINE RNN–SCALED
RNN MDN VAE–MEAN
VAE–TAIL(1) VAE–TAIL(2) VAE–TAIL(3) VAE–TAIL(4)

GoldilocksAndTheThreeBears_007_004

COPY–SYNTH BASELINE RNN–SCALED
RNN MDN VAE–MEAN
VAE–TAIL(1) VAE–TAIL(2) VAE–TAIL(3) VAE–TAIL(4)

TheBoyWhoCriedWolf_014_002

COPY–SYNTH BASELINE RNN–SCALED
RNN MDN VAE–MEAN
VAE–TAIL(1) VAE–TAIL(2) VAE–TAIL(3) VAE–TAIL(4)