How To Generate Song Recommendations Based On Spotify Data
In this post, we’re going to make a Python algorithm for generating song recommendations based on Spotify dataset. Additionally, we’re going to use K-means clustering algorithm to group together songs in the dataset.
The dataset we’ll be using contains many different features for each song, which will allow us to accurately recommend similar songs. Furthermore, it contains data from more than 34.000 artists, making it useful for recommending songs from all sorts of genres.
We’re also going to implement the connection with Spotify API. This will allow us to look up data from songs that might not be in the dataset.
Setting up APIs and dataset
Before we begin writing code, you’ll need to setup your accounts with Kaggle and Spotify, if you haven’t already. To clarify, we’ll use Kaggle to download the Spotify dataset and we’ll need a Spotify account to access their API.
You can follow instructions from a blog post about downloading Kaggle datasets. You’ll also find the steps to setup Kaggle API credentials there.
In order to setup Spotify API, you’ll first need an account, and second you’ll need to create an app that will access it. You can set it up on the developers portal under dashboard, where you’ll also find your apps client id and client secret inside apps settings.
You’ll need those, so just copy and save them into a separate file. I’l also suggest you don’t share these values with anyone, because they give direct access to API from your account.
Authenticating APIs and downloading dataset within Python script
Alright, now lets get to the coding part.
First of all you’ll need to import all the necessary libraries, which include numpy, pandas, scikit-learn, kaggle, spotipy, and more.
You can download them using pip with the following commands.
pip install numpy pip install pandas pip install -U scikit-learn pip install kaggle pip install spotipy pip install scipy
import os
import json
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from scipy.spatial.distance import cdist
from kaggle.api.kaggle_api_extended import KaggleApi
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from collections import defaultdict
from pprint import pprint
Next, we’ll need a function that will fetch us the Spotify apps client id and secret from a json file, where we saved them earlier.
ROOT = os.path.dirname(__file__)
def get_tokens(api):
with open(os.path.join(ROOT, 'auth.json'), 'r') as auth_file:
auth_data = json.load(auth_file)
tokens = auth_data[api]
return tokens
Now, we need to create an instance for both Kaggle and Spotify API client. We’re also going to authenticate them with our credentials.
spotify_tokens = get_tokens('spotify')
spotify_api = spotipy.Spotify(auth_manager=SpotifyClientCredentials(
client_id=spotify_tokens['client_id'],
client_secret=spotify_tokens['client_secret']
))
kaggle_api = KaggleApi()
kaggle_api.authenticate()
Alright, we’re ready for the next step, where we download the Spotify dataset and prepare it for our K-means algorithm.
kaggle_api.dataset_download_files(
'vatsalmavani/spotify-dataset',
path=ROOT,
unzip=True
)
data = pd.read_csv(os.path.join(ROOT, 'data', 'data.csv'))
X = data.select_dtypes(np.number)
number_cols = list(X.columns)
Coding song recommendations generator using the Spotify dataset
First thing we’ll need to do is run the clustering algorithm to label the songs in the dataset into groups.
song_cluster_pipeline = Pipeline([
('scaler', StandardScaler()),
('kmeans', KMeans(n_clusters=20, verbose=False, n_init='auto'))
], verbose=False)
song_cluster_pipeline.fit(X)
song_cluster_labels = song_cluster_pipeline.predict(X)
data['cluster_label'] = song_cluster_labels
Next, we need to write a function that will fetch song data for songs that aren’t in the dataset. Essentially, this function will utilize the Spotify API client to search for it within Spotify. Furthermore, this function is necessary because the dataset, we downloaded with Kaggle, might be outdated.
def find_song(name, year):
song_data = defaultdict()
results = spotify_api.search(q=f'track: {name} year: {year}', limit=1)
if results['tracks']['items'] == []:
return None
results = results['tracks']['items'][0]
track_id = results['id']
audio_features = spotify_api.audio_features(track_id)[0]
song_data['name'] = [name]
song_data['year'] = [year]
song_data['explicit'] = [int(results['explicit'])]
song_data['duration_ms'] = [results['duration_ms']]
song_data['popularity'] = [results['popularity']]
for key, value in audio_features.items():
song_data[key] = value
return pd.DataFrame(song_data)
Alright, now we’ll write a function that will get the data for a song we want.
def get_song_data(song, spotify_data):
try:
song_data = spotify_data[(spotify_data['name'] == song['name']) & (spotify_data['year'] == song['year'])].iloc[0]
return song_data
except IndexError:
return find_song(song['name'], song['year'])
In the following step, we’ll write a couple of utility function for measuring from which cluster recommendations should come from and for reshaping data to fit in the functions we imported.
def get_mean_vector(song_list, spotify_data):
song_vectors = []
for song in song_list:
song_data = get_song_data(song, spotify_data)
if song_data is None:
continue
song_vector = song_data[number_cols].values
song_vectors.append(song_vector)
song_matrix = np.array(list(song_vectors), dtype=object)
return np.mean(song_matrix, axis=0)
def flatten_dict_list(dict_list):
flattened_dict = defaultdict()
for key in dict_list[0].keys():
flattened_dict[key] = []
for dictionary in dict_list:
for key, value in dictionary.items():
flattened_dict[key].append(value)
return flattened_dict
And for the last step, we need to write a function that will return a list of recommended songs from Spotify dataset we downloaded based on the songs we provided.
def recommend_songs(song_list, spotify_data, n_songs=10):
metadata_cols = ['name', 'year', 'artists']
song_dict = flatten_dict_list(song_list)
song_center = get_mean_vector(song_list, spotify_data)
scaler = song_cluster_pipeline.steps[0][1]
scaled_data = scaler.transform(spotify_data[number_cols])
scaled_song_center = scaler.transform(song_center.reshape(1, -1))
distances = cdist(scaled_song_center, scaled_data, 'cosine')
index = list(np.argsort(distances)[:, :n_songs][0])
rec_songs = spotify_data.iloc[index]
rec_songs = rec_songs[~rec_songs['name'].isin(song_dict['name'])]
return rec_songs[metadata_cols].to_dict(orient='records')
songs = recommend_songs([{'name': 'Come As You Are', 'year':1991},
{'name': 'Smells Like Teen Spirit', 'year': 1991},
{'name': 'Lithium', 'year': 1992},
{'name': 'All Apologies', 'year': 1993},
{'name': 'Stay Away', 'year': 1993}], data)
pprint(songs)
[{'artists': "['Rascal Flatts']", 'name': 'Life is a Highway - From "Cars"', 'year': 2009}, {'artists': "['Metallica']", 'name': 'Of Wolf And Man', 'year': 1991}, {'artists': "['Keith Urban']", 'name': 'Somebody Like You', 'year': 2002}, {'artists': "['Marillion']", 'name': 'Kayleigh', 'year': 1992}, {'artists': "['Los Fugitivos']", 'name': 'Corazón Mágico', 'year': 1995}, {'artists': "['Passion Pit']", 'name': 'Little Secrets', 'year': 2009}, {'artists': "['Alice In Chains']", 'name': 'No Excuses', 'year': 1994}, {'artists': "['Nickelback']", 'name': 'If Today Was Your Last Day', 'year': 2008}, {'artists': "['Def Leppard']", 'name': "Let's Get Rocked", 'year': 1992}, {'artists': "['Avril Lavigne']", 'name': "Things I'll Never Say", 'year': 2002}]
You can play with the settings and parameters in case if you want more or less songs, and if you want to cluster them into more groups.
Song recommendations generator based on Spotify dataset
Here is the entire code for this project.
import os
import json
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from scipy.spatial.distance import cdist
from kaggle.api.kaggle_api_extended import KaggleApi
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from collections import defaultdict
from pprint import pprint
ROOT = os.path.dirname(__file__)
def get_tokens(api):
with open(os.path.join(ROOT, 'auth.json'), 'r') as auth_file:
auth_data = json.load(auth_file)
tokens = auth_data[api]
return tokens
spotify_tokens = get_tokens('spotify')
spotify_api = spotipy.Spotify(auth_manager=SpotifyClientCredentials(
client_id=spotify_tokens['client_id'],
client_secret=spotify_tokens['client_secret']
))
kaggle_api = KaggleApi()
kaggle_api.authenticate()
kaggle_api.dataset_download_files(
'vatsalmavani/spotify-dataset',
path=ROOT,
unzip=True
)
data = pd.read_csv(os.path.join(ROOT, 'data', 'data.csv'))
X = data.select_dtypes(np.number)
number_cols = list(X.columns)
song_cluster_pipeline = Pipeline([
('scaler', StandardScaler()),
('kmeans', KMeans(n_clusters=20, verbose=False, n_init='auto'))
], verbose=False)
song_cluster_pipeline.fit(X)
song_cluster_labels = song_cluster_pipeline.predict(X)
data['cluster_label'] = song_cluster_labels
def find_song(name, year):
song_data = defaultdict()
results = spotify_api.search(q=f'track: {name} year: {year}', limit=1)
if results['tracks']['items'] == []:
return None
results = results['tracks']['items'][0]
track_id = results['id']
audio_features = spotify_api.audio_features(track_id)[0]
song_data['name'] = [name]
song_data['year'] = [year]
song_data['explicit'] = [int(results['explicit'])]
song_data['duration_ms'] = [results['duration_ms']]
song_data['popularity'] = [results['popularity']]
for key, value in audio_features.items():
song_data[key] = value
return pd.DataFrame(song_data)
def get_song_data(song, spotify_data):
try:
song_data = spotify_data[(spotify_data['name'] == song['name']) & (spotify_data['year'] == song['year'])].iloc[0]
return song_data
except IndexError:
return find_song(song['name'], song['year'])
def get_mean_vector(song_list, spotify_data):
song_vectors = []
for song in song_list:
song_data = get_song_data(song, spotify_data)
if song_data is None:
continue
song_vector = song_data[number_cols].values
song_vectors.append(song_vector)
song_matrix = np.array(list(song_vectors), dtype=object)
return np.mean(song_matrix, axis=0)
def flatten_dict_list(dict_list):
flattened_dict = defaultdict()
for key in dict_list[0].keys():
flattened_dict[key] = []
for dictionary in dict_list:
for key, value in dictionary.items():
flattened_dict[key].append(value)
return flattened_dict
def recommend_songs(song_list, spotify_data, n_songs=10):
metadata_cols = ['name', 'year', 'artists']
song_dict = flatten_dict_list(song_list)
song_center = get_mean_vector(song_list, spotify_data)
scaler = song_cluster_pipeline.steps[0][1]
scaled_data = scaler.transform(spotify_data[number_cols])
scaled_song_center = scaler.transform(song_center.reshape(1, -1))
distances = cdist(scaled_song_center, scaled_data, 'cosine')
index = list(np.argsort(distances)[:, :n_songs][0])
rec_songs = spotify_data.iloc[index]
rec_songs = rec_songs[~rec_songs['name'].isin(song_dict['name'])]
return rec_songs[metadata_cols].to_dict(orient='records')
songs = recommend_songs([{'name': 'Come As You Are', 'year':1991},
{'name': 'Smells Like Teen Spirit', 'year': 1991},
{'name': 'Lithium', 'year': 1992},
{'name': 'All Apologies', 'year': 1993},
{'name': 'Stay Away', 'year': 1993}], data)
pprint(songs)
Conclusion
To conclude, we made a simple algorithm for generating song recommendations by pulling them out of a Spotify dataset. I learned a lot while working on this project, and I hope it proves itself helpful to you as well.