How To Generate Music With AI Using Tensorflow
In this post, we’ll be making a simple AI model that will generate MIDI music for us. To clarify, MIDI files only contain the notes, which a device then plays back. Therefore, these files may sound different from device to device.
In our example, we’ll be working with a collection of piano MIDI files from the MAESTRO dataset. However, we’re only going to use a handful of them in order to speed up the whole process.
Getting started
As you may notice from the title of this post, we’re going to work with Tensorflow. To explain, Tensorflow is Googles ML library, which enables us the tools to create state of the art algorithms.
However, we’ll keep it simple in this tutorial and use a fairly simple model. Additionally, it will mainly consist of LSTM units. In case you’re not familiar with LSTMs , I suggest you check out a post I made about recurrent neural networks.
Before we begin, we need to setup and import all the libraries and tools, that we’ll need for this project first.
import collections
import glob
import numpy as np
import pathlib
import pandas as pd
import pretty_midi
import tensorflow as tf
from tqdm import tqdm
Preparing the dataset
Okay, now that we have all the tools at the ready, we can begin with dataset preparation. The purpose of this process is to convert the MIDI data into a AI model friendly format. Basically, we’ll change it into a table of numeral values.
But before we can do that, we first need to download the dataset.
data_dir = pathlib.Path('data/maestro-v2.0.0')
if not data_dir.exists():
tf.keras.utils.get_file(
'maestro-v2.0.0-midi.zip',
origin='https://storage.googleapis.com/magentadata/datasets/maestro/v2.0.0/maestro-v2.0.0-midi.zip',
extract=True,
cache_dir='.', cache_subdir='data',
)
filenames = glob.glob(str(data_dir/'**/*.mid*'))
print('Number of files:', len(filenames))
In the following part of this process, we’re going to define a function that will convert MIDI files. Additionally, we’ll end up with a pandas dataframe, which will hold all note values of each MIDI file.
def midi_to_notes(midi_file: str) -> pd.DataFrame:
pm = pretty_midi.PrettyMIDI(midi_file)
instrument = pm.instruments[0]
notes = collections.defaultdict(list)
# Sort the notes by start time
sorted_notes = sorted(instrument.notes, key=lambda note: note.start)
prev_start = sorted_notes[0].start
for note in sorted_notes:
start = note.start
end = note.end
notes['pitch'].append(note.pitch)
notes['start'].append(start)
notes['end'].append(end)
notes['step'].append(start - prev_start)
notes['duration'].append(end - start)
prev_start = start
return pd.DataFrame({name: np.array(value) for name, value in notes.items()})
What we’ll do next is is join together notes data from all MIDI files into one pandas dataframe. Furthermore, we’re going to filter out the most important columns – pitch, step, and duration.
all_notes = []
for f in tqdm(filenames[:10]):
notes = midi_to_notes(f)
all_notes.append(notes)
all_notes = pd.concat(all_notes)
n_notes = len(all_notes)
print('Number of notes parsed:', n_notes)
key_order = ['pitch', 'step', 'duration']
train_notes = np.stack([all_notes[key] for key in key_order], axis=1)
notes_ds = tf.data.Dataset.from_tensor_slices(train_notes)
Since we’ll be working with LSTMs, we’ll need to split the dataset into sequences of equal length. For that, we’re going to define a function that will do just that, and also normalize pitch values.
def create_sequences(
dataset: tf.data.Dataset,
seq_length: int,
vocab_size = 128,
) -> tf.data.Dataset:
seq_length = seq_length+1
# Take 1 extra for the labels
windows = dataset.window(seq_length, shift=1, stride=1,
drop_remainder=True)
# `flat_map` flattens the" dataset of datasets" into a dataset of tensors
flatten = lambda x: x.batch(seq_length, drop_remainder=True)
sequences = windows.flat_map(flatten)
# Normalize note pitch
def scale_pitch(x):
x = x/[vocab_size,1.0,1.0]
return x
# Split the labels
def split_labels(sequences):
inputs = sequences[:-1]
labels_dense = sequences[-1]
labels = {key:labels_dense[i] for i,key in enumerate(key_order)}
return scale_pitch(inputs), labels
return sequences.map(split_labels, num_parallel_calls=tf.data.AUTOTUNE)
Okay, now we’re finally ready to put everything together and get our dataset ready. In the following step, we’ll also batch, cache, and prefetch the dataset.
seq_length = 25
vocab_size = 128
seq_ds = create_sequences(notes_ds, seq_length, vocab_size)
batch_size = 128
train_ds = (seq_ds
.batch(batch_size, drop_remainder=True)
.cache()
.prefetch(tf.data.experimental.AUTOTUNE))
Putting together the AI model
In order to make our AI model to be able to generate something that at least resembles music, we need to train it. But, before we do that, we need to set it up first.
This process entails defining loss function and model, compiling the model, and training the model. Therefore, let’s start with the loss function.
def mse_with_positive_pressure(y_true: tf.Tensor, y_pred: tf.Tensor):
mse = (y_true - y_pred) ** 2
positive_pressure = 10 * tf.maximum(-y_pred, 0.0)
return tf.reduce_mean(mse + positive_pressure)
We’ll make our model output 3 different values, so we’ll need to use a loss function for each. That is because the model will learn to output each value individually.
In the following part of this guide, we’ll define the model and everything else we need to compile it.
input_shape = (seq_length, 3)
learning_rate = 0.005
inputs = tf.keras.Input(input_shape)
x = tf.keras.layers.LSTM(256)(inputs)
outputs = {
'pitch': tf.keras.layers.Dense(128, name='pitch')(x),
'step': tf.keras.layers.Dense(1, name='step')(x),
'duration': tf.keras.layers.Dense(1, name='duration')(x),
}
model = tf.keras.Model(inputs, outputs)
loss = {
'pitch': tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True),
'step': mse_with_positive_pressure,
'duration': mse_with_positive_pressure,
}
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
Because we have multiple loss values, we also get an imbalance when it comes to calculating the total loss value. Therefore, we need to use loss_weights
argument to deal with this problem.
model.compile(
loss=loss,
loss_weights={
'pitch': 0.05,
'step': 1.0,
'duration': 1.0,
},
optimizer=optimizer,
)
Okay, before we get to training the model, let’s act professional and just include a couple of callbacks to make our lives easier. To explain, callbacks are functions that fire after each epoch, like saving model weights and checking whether it still improves.
These are just the ones we’ll be adding to our model, which we also call model checkpoint and early stopping callback.
callbacks = [
tf.keras.callbacks.ModelCheckpoint(
filepath='./training_checkpoints/ckpt_{epoch}',
save_weights_only=True),
tf.keras.callbacks.EarlyStopping(
monitor='loss',
patience=5,
verbose=1,
restore_best_weights=True),
]
Now we’re finally ready to start training our model. Not gonna lie, the amount of stuff that leads to this part, can sometimes feel quite overwhelming. However, when you make these types of algorithms, you’ll be able to see that there is a repeating pattern.
So without further a do, let’s start the training process.
epochs = 100
history = model.fit(
train_ds,
epochs=epochs,
callbacks=callbacks,
)
This part of the process might take a little while, depending on how powerful GPU you’re using. However, we only need to do it once.
Let’s generate music with our trained AI model
Finally, we’ve come to the part where we can actually see what our model is capable of. However, we’ll need to apply a couple of changes to our data. You can also look at this part as a reverse process to what we did to the data when we were preparing the dataset.
So the first thing we’ll need is a function that will predict a following note. Furthermore, when we’ll be generating music, we’ll call this function repeatedly.
def predict_next_note(
notes: np.ndarray,
model: tf.keras.Model,
temperature: float = 1.0) -> int:
assert temperature > 0
# Add batch dimension
inputs = tf.expand_dims(notes, 0)
predictions = model.predict(inputs)
pitch_logits = predictions['pitch']
step = predictions['step']
duration = predictions['duration']
pitch_logits /= temperature
pitch = tf.random.categorical(pitch_logits, num_samples=1)
pitch = tf.squeeze(pitch, axis=-1)
duration = tf.squeeze(duration, axis=-1)
step = tf.squeeze(step, axis=-1)
# `step` and `duration` values should be non-negative
step = tf.maximum(0, step)
duration = tf.maximum(0, duration)
return int(pitch), float(step), float(duration)
Next step in this process is to create a pandas dataframe with notes data. As we said, we’ll call the function above repeatedly until we get enough data for a reasonably long audio file.
temperature = 2.0
num_predictions = 120
input_notes = np.stack([all_notes[key] for key in key_order], axis=1)[:seq_length] / np.array([vocab_size, 1, 1])
generated_notes = []
prev_start = 0
for _ in range(num_predictions):
pitch, step, duration = predict_next_note(input_notes, model, temperature)
start = prev_start + step
end = start + duration
input_note = (pitch, step, duration)
generated_notes.append((*input_note, start, end))
input_notes = np.delete(input_notes, 0, axis=0)
input_notes = np.append(input_notes, np.expand_dims(input_note, 0), axis=0)
prev_start = start
generated_notes = pd.DataFrame(
generated_notes, columns=(*key_order, 'start', 'end'))
Now that we have the generated notes ready, all we need to do is convert it into a MIDI file. For this part, we’ll define a function that will do just that.
def notes_to_midi(
notes: pd.DataFrame,
out_file: str,
instrument_name: str,
velocity: int = 100, # note loudness
) -> pretty_midi.PrettyMIDI:
pm = pretty_midi.PrettyMIDI()
instrument = pretty_midi.Instrument(
program=pretty_midi.instrument_name_to_program(
instrument_name))
prev_start = 0
for i, note in notes.iterrows():
start = float(prev_start + note['step'])
end = float(start + note['duration'])
note = pretty_midi.Note(
velocity=velocity,
pitch=int(note['pitch']),
start=start,
end=end,
)
instrument.notes.append(note)
prev_start = start
pm.instruments.append(instrument)
pm.write(out_file)
return pm
Alright! For the final step in this process, we need to call it.
out_file = 'output.mid'
out_pm = notes_to_midi(
generated_notes, out_file=out_file, instrument_name='Acoustic Grand Piano')
Here you go, if you put everything together, you should find an output file of freshly generated masterpiece in your local directory.
Conclusion
To conclude, we made a simple AI model that can generate music in MIDI format. I had a lot of fun playing with it and I hope you will too.
I also hope this guide helped you gain a better understanding about how recurrent neural networks work and how we can utilize it.