We went through key papers during 2 months on Generative models with Generative models study group. In order to share our insights and knowledge obtained, we provide honest subjective opinion on those papers and generally about limitations and strengths of so striking popular generative models! Thanks for study group participants for preparing this overview.
First part with overview of the Plug and Play Generative Networks: http://www.aihelsinki.com/gans-overview-plug-and-play/
We looked at a paper published by the Google DeepMind’s team on Sep 8th 2016 describing a neural network architecture that they claimed had excellent result at generating audio https://arxiv.org/pdf/1609.03499. The paper is accompanied with a blogpost describing the approach and providing samples of the results they obtained: https://deepmind.com/blog/wavenet-generative-model-raw-audio/
WaveNet is a deep generative model of audio data that operates directly at the waveform level. It was inspired by PixelCNN. WaveNets are autoregressive and combine causal filters with dilated convolutions to allow their receptive fields to grow exponentially with depth, which is important to model the long-range temporal dependencies in audio signals. We have shown how WaveNets can be conditioned on other inputs in a global (e.g. speaker identity) or local way (e.g. linguistic features).
When applied to TTS, WaveNets produced samples that outperform the current best TTS systems in subjective naturalness.
WaveNets showed very promising results when applied to music audio modeling for solo instruments. One notable strength of the approach is that the network is trained on raw audio, and is able to pick up subtleties, like non-spoken sounds (e.g. breathings) for speech or instrument noises that you would not get from vocoders or synthesizers. One noticeable characteristic of this network is its performance. If the training might not be that different from other approaches, the generation is the one that is slow. That’s because the causality of each sample to the previous ones is preventing an efficient execution on GPUs. The Google team has not released any code, unfortunately. That left a lot of people on the forums wondering about some of the tricks they mentioned but didn’t fully explained in the paper. We ran one basic implementation on keras: https://github.com/basveeling/wavenet
This approach has raised some attention in the field, specially because of the good results that it yields, however the generation times are a clear downside.
But things move fast and the Baidu team has already come up with their own answer, Deep Speech, that sounds also quite promising: https://arxiv.org/pdf/1512.02595.pdf