Keras2-based WaveNet published on GitHub

At Munich Artificial Intelligence Laboratories (M-AILABS), we experiment with a lot of technologies and methods. We also look for novel ways of doing things in the area of ML & AI.

One of my personal dreams is to have an AI-based solution that can “compose” music. And of course, it has to be music like the great Ludwig van. So, we started looking for some papers.

TL;DR; Go ahead, go to GitHub, you’ll find the new Keras2-based WaveNet 🙂

Google last year published a paper named “WaveNet: A Generative Model for Raw Audio” (PDF) and they also published a blog-entry about that. This year, Google used WaveNet2 to implement their speech-output for the Google Assistant in Android.

The great thing about WaveNet is that it not only can output speech but, speech being just a different type of “music”, it can output any kind of audio based on its training material.

You should definitely listen to the audio samples they have on their blog-page.

After finding this paper, we also found lots of WaveNet-implementations based on TensorFlow and other technologies.

Keras being our most beloved library, we were looking (desperately) if someone had dome something in Keras in that direction. The problem with Keras is, it is (unfortunately) a moving target. Francois Chollet is releasing new version so fast that we can’t really keep up 🙂 (Thanks Francois, fantastic job btw).

Finally, a few days ago, I found a Keras implementation. But to my horror it was based on Keras 1.2..1 and Theano.

Since the developers of Theano announced, a while ago, that they stop any further development in Theano, we didn’t invest any more time into it. Also, there are a lot of technical issues such as no support for Multi-GPU/distributed training and so on.

We love Keras and like TensorFlow (and CNTK and PyTorch); we are also actively looking into Neon as well as MXNet, but Keras (w/ TensorFlow) is currently our core Deep Learning Framework, even though we try to limit our reliance on TensorFlow as much as we can…

Anyway, I found this repository called Keras WaveNet Implementation, which needed some work.

Long story, short: you can find the new repository named WaveNet implementation in Keras2 on GitHub.

Here is the gist of changes:

  • It does not use Theano anymore. In fact, in the single-GPU version it doesn’t even care whether it is Theano or TensorFlow or CNTK
  • It is completely ported to Keras 2
  • And a few bugs were fixed
  • Some renaming happened

While at it and while I had some time, I wanted to get the stuff training really, really, really fast. You know, like: “Hey, I have here two servers with two GPUs each, why not distributed training?”

Horovod-Support

Born was the support for Horovod/Keras for Keras2-based WaveNet.

Horovod is actually a great framework for distributing neuronal network across GPUs or even across servers and GPUs there. With this implementation, we have only touched the tip of the iceberg of what we can do with Horovod.

Using Horovod for parallel training is much, much, much easier than using TensorFlow Towers. You have to understand a few things but once you get it, it is very easy.

For example, here is the change in the code that we had to make:

At the top of the “wavenet.py”, I had to add this:

 import tensorflow as tf
 import horovod.keras as hvd
 hvd.init()
 # Pin GPU to be used to process local rank (one GPU per process)
 config = tf.ConfigProto()
 config.gpu_options.allow_growth = True
 config.gpu_options.visible_device_list = str(hvd.local_rank())
 print('GPU-Options', config.gpu_options.visible_device_list)
 K.set_session(tf.Session(config=config))

Towards the end of the file, I had to replace:

optim = make_optimizer()

with

optim = make_optimizer()
optim = hvd.DistributedOptimizer(optim)

and change the callbacks to be:

callbacks = [
# Broadcast initial variable states from rank 0 to all other processes.
# This is necessary to ensure consistent initialization of all workers when
# training is started with random weights or restored from a checkpoint.
hvd.callbacks.BroadcastGlobalVariablesCallback(0),
# Average metrics among workers at the end of every epoch.
#
# Note: This callback must be in the list before the ReduceLROnPlateau,
# TensorBoard or other metrics-based callbacks.
hvd.callbacks.MetricAverageCallback(),
# Using `lr = 1.0 * hvd.size()` from the very beginning leads to worse final
# accuracy. Scale the learning rate `lr = 1.0` ---> `lr = 1.0 * hvd.size()` during
# the first five epochs. See https://arxiv.org/abs/1706.02677 for details.
hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1),
        ReduceLROnPlateau(patience=early_stopping_patience / 2, cooldown=early_stopping_patience / 4, verbose=1),
        EarlyStopping(patience=early_stopping_patience, verbose=1),
]

Then, do this:

if not debug and hvd.rank() == 0:
        callbacks.extend([
            ModelCheckpoint(os.path.join(checkpoint_dir, 'checkpoint.{epoch:05d}-{val_loss:.3f}.hdf5'),save_best_only=True),
            CSVLogger(os.path.join(run_dir, 'history.csv')),
        ])

The above part is important because we want only the primary process to write the checkpoint files, otherwise every process would write checkpoint files.

And lastly, for training (fit) change:

 model.fit_generator(data_generators['train'],
                        nb_examples['train'],
                        epochs=nb_epoch,
                        validation_data=data_generators['test'],
                        validation_steps=nb_examples['test'],
                        callbacks=callbacks,
                        verbose=keras_verbose)

to this:

 model.fit_generator(data_generators['train'],
                        nb_examples['train'] // hvd.size(),
                        epochs=nb_epoch,
                        validation_data=data_generators['test'],
                        validation_steps=nb_examples['test'] // hvd.size(),
                        callbacks=callbacks,
                        verbose=keras_verbose)

In this implementation, it is important that your data-generator returns batches randomly (I was lazy). But it would be even better if your data-generator would return data intelligently depending on the hvd.rank() (next version).

After installing OpenMP and Horovod, the only thing to do was to figure out the right command-line to run the training:

/usr/local/bin/mpirun -np 2 -H localhost:2 -bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -mca btl_tcp_if_exclude eno1 \
python wavenet_mgpu.py

Great, isn’t it?

We achieve around 85% efficiency:

  • 1 Epoch, Single-GPU, no Horovod: ~6:00:00 hrs
  • 1 Epoch, Dual-GPU, single server, using Horovod: ~3:30:00
  • 1 Epoch, Quad-GPU, two server, using Horovod: TBA (not tested yet)

According to Horovod-website, maximum so far was 90%. Let’s see if we can increase our efficiency over time.

So, please, go ahead, clone the repository (there is some test-train data you can immediately use) and have fun. Let us know what you think.

Next Steps

We will test with multi-server environment, check for optimization possibilities and also finally, one day, generate music based on Chopin’s and Ludwig van’s works. While we are training, we will test on checkpoints and publish any new audio that we can find. Stay tuned…

Oh, BTW

Oh, btw, the original reason for doing this was not really generating music but rather writing a Text-to-Speech Engine that generates absolutely natural sounding Speech. We are working on that, too, and have some great initial results with English voice. Currently, we are generating German training data and hope to show something in the next 1-2 months… so, again, stay tuned…