generate music with AI model

How To Generate Music With AI Using Tensorflow

In this post, we’ll be making a simple AI model that will generate MIDI music for us. To clarify, MIDI files only contain the notes, which a device then plays back. Therefore, these files may sound different from device to device.

In our example, we’ll be working with a collection of piano MIDI files from the MAESTRO dataset. However, we’re only going to use a handful of them in order to speed up the whole process.

Getting started

As you may notice from the title of this post, we’re going to work with Tensorflow. To explain, Tensorflow is Googles ML library, which enables us the tools to create state of the art algorithms.

However, we’ll keep it simple in this tutorial and use a fairly simple model. Additionally, it will mainly consist of LSTM units. In case you’re not familiar with LSTMs , I suggest you check out a post I made about recurrent neural networks.

Before we begin, we need to setup and import all the libraries and tools, that we’ll need for this project first.

import collections
import glob
import numpy as np
import pathlib
import pandas as pd
import pretty_midi
import tensorflow as tf
from tqdm import tqdm

Preparing the dataset

Okay, now that we have all the tools at the ready, we can begin with dataset preparation. The purpose of this process is to convert the MIDI data into a AI model friendly format. Basically, we’ll change it into a table of numeral values.

But before we can do that, we first need to download the dataset.

data_dir = pathlib.Path('data/maestro-v2.0.0')
if not data_dir.exists():
    tf.keras.utils.get_file(
        'maestro-v2.0.0-midi.zip',
        origin='https://storage.googleapis.com/magentadata/datasets/maestro/v2.0.0/maestro-v2.0.0-midi.zip',
        extract=True,
        cache_dir='.', cache_subdir='data',
    )

filenames = glob.glob(str(data_dir/'**/*.mid*'))
print('Number of files:', len(filenames))

In the following part of this process, we’re going to define a function that will convert MIDI files. Additionally, we’ll end up with a pandas dataframe, which will hold all note values of each MIDI file.

def midi_to_notes(midi_file: str) -> pd.DataFrame:
    pm = pretty_midi.PrettyMIDI(midi_file)
    instrument = pm.instruments[0]
    notes = collections.defaultdict(list)

    # Sort the notes by start time
    sorted_notes = sorted(instrument.notes, key=lambda note: note.start)
    prev_start = sorted_notes[0].start

    for note in sorted_notes:
        start = note.start
        end = note.end
        notes['pitch'].append(note.pitch)
        notes['start'].append(start)
        notes['end'].append(end)
        notes['step'].append(start - prev_start)
        notes['duration'].append(end - start)
        prev_start = start

    return pd.DataFrame({name: np.array(value) for name, value in notes.items()})

What we’ll do next is is join together notes data from all MIDI files into one pandas dataframe. Furthermore, we’re going to filter out the most important columns – pitch, step, and duration.

all_notes = []
for f in tqdm(filenames[:10]):
  notes = midi_to_notes(f)
  all_notes.append(notes)

all_notes = pd.concat(all_notes)

n_notes = len(all_notes)
print('Number of notes parsed:', n_notes)

key_order = ['pitch', 'step', 'duration']
train_notes = np.stack([all_notes[key] for key in key_order], axis=1)

notes_ds = tf.data.Dataset.from_tensor_slices(train_notes)

Since we’ll be working with LSTMs, we’ll need to split the dataset into sequences of equal length. For that, we’re going to define a function that will do just that, and also normalize pitch values.

def create_sequences(
    dataset: tf.data.Dataset, 
    seq_length: int,
    vocab_size = 128,
    ) -> tf.data.Dataset:

    seq_length = seq_length+1

    # Take 1 extra for the labels
    windows = dataset.window(seq_length, shift=1, stride=1,
                                drop_remainder=True)

    # `flat_map` flattens the" dataset of datasets" into a dataset of tensors
    flatten = lambda x: x.batch(seq_length, drop_remainder=True)
    sequences = windows.flat_map(flatten)

    # Normalize note pitch
    def scale_pitch(x):
        x = x/[vocab_size,1.0,1.0]
        return x

    # Split the labels
    def split_labels(sequences):
        inputs = sequences[:-1]
        labels_dense = sequences[-1]
        labels = {key:labels_dense[i] for i,key in enumerate(key_order)}

        return scale_pitch(inputs), labels

    return sequences.map(split_labels, num_parallel_calls=tf.data.AUTOTUNE)

Okay, now we’re finally ready to put everything together and get our dataset ready. In the following step, we’ll also batch, cache, and prefetch the dataset.

seq_length = 25
vocab_size = 128
seq_ds = create_sequences(notes_ds, seq_length, vocab_size)

batch_size = 128
train_ds = (seq_ds
            .batch(batch_size, drop_remainder=True)
            .cache()
            .prefetch(tf.data.experimental.AUTOTUNE))

Putting together the AI model

In order to make our AI model to be able to generate something that at least resembles music, we need to train it. But, before we do that, we need to set it up first.

This process entails defining loss function and model, compiling the model, and training the model. Therefore, let’s start with the loss function.

def mse_with_positive_pressure(y_true: tf.Tensor, y_pred: tf.Tensor):
    mse = (y_true - y_pred) ** 2
    positive_pressure = 10 * tf.maximum(-y_pred, 0.0)
    return tf.reduce_mean(mse + positive_pressure)

We’ll make our model output 3 different values, so we’ll need to use a loss function for each. That is because the model will learn to output each value individually.

In the following part of this guide, we’ll define the model and everything else we need to compile it.

input_shape = (seq_length, 3)
learning_rate = 0.005

inputs = tf.keras.Input(input_shape)
x = tf.keras.layers.LSTM(256)(inputs)

outputs = {
  'pitch': tf.keras.layers.Dense(128, name='pitch')(x),
  'step': tf.keras.layers.Dense(1, name='step')(x),
  'duration': tf.keras.layers.Dense(1, name='duration')(x),
}

model = tf.keras.Model(inputs, outputs)

loss = {
      'pitch': tf.keras.losses.SparseCategoricalCrossentropy(
          from_logits=True),
      'step': mse_with_positive_pressure,
      'duration': mse_with_positive_pressure,
}

optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

Because we have multiple loss values, we also get an imbalance when it comes to calculating the total loss value. Therefore, we need to use loss_weights argument to deal with this problem.

model.compile(
    loss=loss,
    loss_weights={
        'pitch': 0.05,
        'step': 1.0,
        'duration': 1.0,
    },
    optimizer=optimizer,
)

Okay, before we get to training the model, let’s act professional and just include a couple of callbacks to make our lives easier. To explain, callbacks are functions that fire after each epoch, like saving model weights and checking whether it still improves.

These are just the ones we’ll be adding to our model, which we also call model checkpoint and early stopping callback.

callbacks = [
    tf.keras.callbacks.ModelCheckpoint(
        filepath='./training_checkpoints/ckpt_{epoch}',
        save_weights_only=True),
    tf.keras.callbacks.EarlyStopping(
        monitor='loss',
        patience=5,
        verbose=1,
        restore_best_weights=True),
]

Now we’re finally ready to start training our model. Not gonna lie, the amount of stuff that leads to this part, can sometimes feel quite overwhelming. However, when you make these types of algorithms, you’ll be able to see that there is a repeating pattern.

So without further a do, let’s start the training process.

epochs = 100

history = model.fit(
    train_ds,
    epochs=epochs,
    callbacks=callbacks,
)

This part of the process might take a little while, depending on how powerful GPU you’re using. However, we only need to do it once.

training AI model to generate music
Training the AI model to generate music

Let’s generate music with our trained AI model

Finally, we’ve come to the part where we can actually see what our model is capable of. However, we’ll need to apply a couple of changes to our data. You can also look at this part as a reverse process to what we did to the data when we were preparing the dataset.

So the first thing we’ll need is a function that will predict a following note. Furthermore, when we’ll be generating music, we’ll call this function repeatedly.

def predict_next_note(
    notes: np.ndarray, 
    model: tf.keras.Model, 
    temperature: float = 1.0) -> int:

    assert temperature > 0

    # Add batch dimension
    inputs = tf.expand_dims(notes, 0)

    predictions = model.predict(inputs)
    pitch_logits = predictions['pitch']
    step = predictions['step']
    duration = predictions['duration']

    pitch_logits /= temperature
    pitch = tf.random.categorical(pitch_logits, num_samples=1)
    pitch = tf.squeeze(pitch, axis=-1)
    duration = tf.squeeze(duration, axis=-1)
    step = tf.squeeze(step, axis=-1)

    # `step` and `duration` values should be non-negative
    step = tf.maximum(0, step)
    duration = tf.maximum(0, duration)

    return int(pitch), float(step), float(duration)

Next step in this process is to create a pandas dataframe with notes data. As we said, we’ll call the function above repeatedly until we get enough data for a reasonably long audio file.

temperature = 2.0
num_predictions = 120

input_notes = np.stack([all_notes[key] for key in key_order], axis=1)[:seq_length] / np.array([vocab_size, 1, 1])
generated_notes = []
prev_start = 0
for _ in range(num_predictions):
  pitch, step, duration = predict_next_note(input_notes, model, temperature)
  start = prev_start + step
  end = start + duration
  input_note = (pitch, step, duration)
  generated_notes.append((*input_note, start, end))
  input_notes = np.delete(input_notes, 0, axis=0)
  input_notes = np.append(input_notes, np.expand_dims(input_note, 0), axis=0)
  prev_start = start

generated_notes = pd.DataFrame(
    generated_notes, columns=(*key_order, 'start', 'end'))

Now that we have the generated notes ready, all we need to do is convert it into a MIDI file. For this part, we’ll define a function that will do just that.

def notes_to_midi(
    notes: pd.DataFrame,
    out_file: str, 
    instrument_name: str,
    velocity: int = 100,  # note loudness
    ) -> pretty_midi.PrettyMIDI:

    pm = pretty_midi.PrettyMIDI()
    instrument = pretty_midi.Instrument(
        program=pretty_midi.instrument_name_to_program(
            instrument_name))

    prev_start = 0
    for i, note in notes.iterrows():
        start = float(prev_start + note['step'])
        end = float(start + note['duration'])
        note = pretty_midi.Note(
            velocity=velocity,
            pitch=int(note['pitch']),
            start=start,
            end=end,
        )
        instrument.notes.append(note)
        prev_start = start

    pm.instruments.append(instrument)
    pm.write(out_file)
    return pm

Alright! For the final step in this process, we need to call it.

out_file = 'output.mid'
out_pm = notes_to_midi(
    generated_notes, out_file=out_file, instrument_name='Acoustic Grand Piano')

Here you go, if you put everything together, you should find an output file of freshly generated masterpiece in your local directory.

Conclusion

To conclude, we made a simple AI model that can generate music in MIDI format. I had a lot of fun playing with it and I hope you will too.

I also hope this guide helped you gain a better understanding about how recurrent neural networks work and how we can utilize it.

Share this article:

Related posts

Discussion(0)