High-fidelity speech synthesis with WaveNet
In October we announced that our state-of-the-art speech synthesis model WaveNet was being used to generate realistic-sounding voices for the Google Assistant globally in Japanese and the US English. This production model - known as parallel WaveNet - is more than 1000 times faster than the original and also capable of creating higher quality audio.Our latest paper introduces details of the new model and the probability density distillation technique we developed to allow the system to work in a massively parallel computing environment.The original WaveNet model used autoregressive connections to synthesise the waveform one sample at a time, with each new sample conditioned on the previous samples. While this produces high-quality audio with up to 24,000 samples per second, this sequential generation is too slow for production environments.