Home » How To Generate Song Recommendations Based On Spotify Data

How To Generate Song Recommendations Based On Spotify Data

In this post, we’re going to make a Python algorithm for generating song recommendations based on Spotify dataset. Additionally, we’re going to use K-means clustering algorithm to group together songs in the dataset.

The dataset we’ll be using contains many different features for each song, which will allow us to accurately recommend similar songs. Furthermore, it contains data from more than 34.000 artists, making it useful for recommending songs from all sorts of genres.

We’re also going to implement the connection with Spotify API. This will allow us to look up data from songs that might not be in the dataset.

Setting up APIs and dataset

Before we begin writing code, you’ll need to setup your accounts with Kaggle and Spotify, if you haven’t already. To clarify, we’ll use Kaggle to download the Spotify dataset and we’ll need a Spotify account to access their API.

You can follow instructions from a blog post about downloading Kaggle datasets. You’ll also find the steps to setup Kaggle API credentials there.

In order to setup Spotify API, you’ll first need an account, and second you’ll need to create an app that will access it. You can set it up on the developers portal under dashboard, where you’ll also find your apps client id and client secret inside apps settings.

You’ll need those, so just copy and save them into a separate file. I’l also suggest you don’t share these values with anyone, because they give direct access to API from your account.

Authenticating APIs and downloading dataset within Python script

Alright, now lets get to the coding part.

First of all you’ll need to import all the necessary libraries, which include numpy, pandas, scikit-learn, kaggle, spotipy, and more.

You can download them using pip with the following commands.

pip install numpy
pip install pandas
pip install -U scikit-learn
pip install kaggle
pip install spotipy
pip install scipy

import os
import json
import numpy as np
import pandas as pd

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from scipy.spatial.distance import cdist

from kaggle.api.kaggle_api_extended import KaggleApi

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from collections import defaultdict
from pprint import pprint

Next, we’ll need a function that will fetch us the Spotify apps client id and secret from a json file, where we saved them earlier.

ROOT = os.path.dirname(__file__)

def get_tokens(api):
    with open(os.path.join(ROOT, 'auth.json'), 'r') as auth_file:
        auth_data = json.load(auth_file)
        tokens = auth_data[api]
        return tokens

Now, we need to create an instance for both Kaggle and Spotify API client. We’re also going to authenticate them with our credentials.

spotify_tokens = get_tokens('spotify')

spotify_api = spotipy.Spotify(auth_manager=SpotifyClientCredentials(
    client_id=spotify_tokens['client_id'],
    client_secret=spotify_tokens['client_secret']
))

kaggle_api = KaggleApi()
kaggle_api.authenticate()

Alright, we’re ready for the next step, where we download the Spotify dataset and prepare it for our K-means algorithm.

kaggle_api.dataset_download_files(
    'vatsalmavani/spotify-dataset',
    path=ROOT,
    unzip=True
)

data = pd.read_csv(os.path.join(ROOT, 'data', 'data.csv'))

X = data.select_dtypes(np.number)
number_cols = list(X.columns)

Coding song recommendations generator using the Spotify dataset

First thing we’ll need to do is run the clustering algorithm to label the songs in the dataset into groups.

song_cluster_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('kmeans', KMeans(n_clusters=20, verbose=False, n_init='auto'))
], verbose=False)

song_cluster_pipeline.fit(X)
song_cluster_labels = song_cluster_pipeline.predict(X)
data['cluster_label'] = song_cluster_labels

Next, we need to write a function that will fetch song data for songs that aren’t in the dataset. Essentially, this function will utilize the Spotify API client to search for it within Spotify. Furthermore, this function is necessary because the dataset, we downloaded with Kaggle, might be outdated.

def find_song(name, year):
    song_data = defaultdict()
    results = spotify_api.search(q=f'track: {name} year: {year}', limit=1)
    if results['tracks']['items'] == []:
        return None
    
    results = results['tracks']['items'][0]
    track_id = results['id']
    audio_features = spotify_api.audio_features(track_id)[0]

    song_data['name'] = [name]
    song_data['year'] = [year]
    song_data['explicit'] = [int(results['explicit'])]
    song_data['duration_ms'] = [results['duration_ms']]
    song_data['popularity'] = [results['popularity']]

    for key, value in audio_features.items():
        song_data[key] = value
    
    return pd.DataFrame(song_data)

Alright, now we’ll write a function that will get the data for a song we want.

def get_song_data(song, spotify_data):
    try:
        song_data = spotify_data[(spotify_data['name'] == song['name']) & (spotify_data['year'] == song['year'])].iloc[0]
        return song_data
    except IndexError:
        return find_song(song['name'], song['year'])

In the following step, we’ll write a couple of utility function for measuring from which cluster recommendations should come from and for reshaping data to fit in the functions we imported.

def get_mean_vector(song_list, spotify_data):
    song_vectors = []

    for song in song_list:
        song_data = get_song_data(song, spotify_data)
        if song_data is None:
            continue
        song_vector = song_data[number_cols].values
        song_vectors.append(song_vector)
    
    song_matrix = np.array(list(song_vectors), dtype=object)
    return np.mean(song_matrix, axis=0)

def flatten_dict_list(dict_list):

    flattened_dict = defaultdict()
    for key in dict_list[0].keys():
        flattened_dict[key] = []
    
    for dictionary in dict_list:
        for key, value in dictionary.items():
            flattened_dict[key].append(value)
        
    return flattened_dict

And for the last step, we need to write a function that will return a list of recommended songs from Spotify dataset we downloaded based on the songs we provided.

def recommend_songs(song_list, spotify_data, n_songs=10):
    metadata_cols = ['name', 'year', 'artists']
    song_dict = flatten_dict_list(song_list)

    song_center = get_mean_vector(song_list, spotify_data)
    scaler = song_cluster_pipeline.steps[0][1]
    scaled_data = scaler.transform(spotify_data[number_cols])
    scaled_song_center = scaler.transform(song_center.reshape(1, -1))
    distances = cdist(scaled_song_center, scaled_data, 'cosine')
    index = list(np.argsort(distances)[:, :n_songs][0])

    rec_songs = spotify_data.iloc[index]
    rec_songs = rec_songs[~rec_songs['name'].isin(song_dict['name'])]
    return rec_songs[metadata_cols].to_dict(orient='records')

songs = recommend_songs([{'name': 'Come As You Are', 'year':1991},
                {'name': 'Smells Like Teen Spirit', 'year': 1991},
                {'name': 'Lithium', 'year': 1992},
                {'name': 'All Apologies', 'year': 1993},
                {'name': 'Stay Away', 'year': 1993}],  data)

pprint(songs)

[{'artists': "['Rascal Flatts']",
  'name': 'Life is a Highway - From "Cars"',
  'year': 2009},
 {'artists': "['Metallica']", 'name': 'Of Wolf And Man', 'year': 1991},
 {'artists': "['Keith Urban']", 'name': 'Somebody Like You', 'year': 2002},
 {'artists': "['Marillion']", 'name': 'Kayleigh', 'year': 1992},
 {'artists': "['Los Fugitivos']", 'name': 'Corazón Mágico', 'year': 1995},
 {'artists': "['Passion Pit']", 'name': 'Little Secrets', 'year': 2009},
 {'artists': "['Alice In Chains']", 'name': 'No Excuses', 'year': 1994},
 {'artists': "['Nickelback']",
  'name': 'If Today Was Your Last Day',
  'year': 2008},
 {'artists': "['Def Leppard']", 'name': "Let's Get Rocked", 'year': 1992},
 {'artists': "['Avril Lavigne']",
  'name': "Things I'll Never Say",
  'year': 2002}]

You can play with the settings and parameters in case if you want more or less songs, and if you want to cluster them into more groups.

Song recommendations generator based on Spotify dataset

Here is the entire code for this project.

import os
import json
import numpy as np
import pandas as pd

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from scipy.spatial.distance import cdist

from kaggle.api.kaggle_api_extended import KaggleApi

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from collections import defaultdict
from pprint import pprint

ROOT = os.path.dirname(__file__)

def get_tokens(api):
    with open(os.path.join(ROOT, 'auth.json'), 'r') as auth_file:
        auth_data = json.load(auth_file)
        tokens = auth_data[api]
        return tokens
    
spotify_tokens = get_tokens('spotify')

spotify_api = spotipy.Spotify(auth_manager=SpotifyClientCredentials(
    client_id=spotify_tokens['client_id'],
    client_secret=spotify_tokens['client_secret']
))

kaggle_api = KaggleApi()
kaggle_api.authenticate()

kaggle_api.dataset_download_files(
    'vatsalmavani/spotify-dataset',
    path=ROOT,
    unzip=True
)

data = pd.read_csv(os.path.join(ROOT, 'data', 'data.csv'))

X = data.select_dtypes(np.number)
number_cols = list(X.columns)

song_cluster_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('kmeans', KMeans(n_clusters=20, verbose=False, n_init='auto'))
], verbose=False)

song_cluster_pipeline.fit(X)
song_cluster_labels = song_cluster_pipeline.predict(X)
data['cluster_label'] = song_cluster_labels

def find_song(name, year):
    song_data = defaultdict()
    results = spotify_api.search(q=f'track: {name} year: {year}', limit=1)
    if results['tracks']['items'] == []:
        return None
    
    results = results['tracks']['items'][0]
    track_id = results['id']
    audio_features = spotify_api.audio_features(track_id)[0]

    song_data['name'] = [name]
    song_data['year'] = [year]
    song_data['explicit'] = [int(results['explicit'])]
    song_data['duration_ms'] = [results['duration_ms']]
    song_data['popularity'] = [results['popularity']]

    for key, value in audio_features.items():
        song_data[key] = value
    
    return pd.DataFrame(song_data)

def get_song_data(song, spotify_data):
    try:
        song_data = spotify_data[(spotify_data['name'] == song['name']) & (spotify_data['year'] == song['year'])].iloc[0]
        return song_data
    except IndexError:
        return find_song(song['name'], song['year'])

def get_mean_vector(song_list, spotify_data):
    song_vectors = []

    for song in song_list:
        song_data = get_song_data(song, spotify_data)
        if song_data is None:
            continue
        song_vector = song_data[number_cols].values
        song_vectors.append(song_vector)
    
    song_matrix = np.array(list(song_vectors), dtype=object)
    return np.mean(song_matrix, axis=0)

def flatten_dict_list(dict_list):

    flattened_dict = defaultdict()
    for key in dict_list[0].keys():
        flattened_dict[key] = []
    
    for dictionary in dict_list:
        for key, value in dictionary.items():
            flattened_dict[key].append(value)
        
    return flattened_dict

def recommend_songs(song_list, spotify_data, n_songs=10):
    metadata_cols = ['name', 'year', 'artists']
    song_dict = flatten_dict_list(song_list)

    song_center = get_mean_vector(song_list, spotify_data)
    scaler = song_cluster_pipeline.steps[0][1]
    scaled_data = scaler.transform(spotify_data[number_cols])
    scaled_song_center = scaler.transform(song_center.reshape(1, -1))
    distances = cdist(scaled_song_center, scaled_data, 'cosine')
    index = list(np.argsort(distances)[:, :n_songs][0])

    rec_songs = spotify_data.iloc[index]
    rec_songs = rec_songs[~rec_songs['name'].isin(song_dict['name'])]
    return rec_songs[metadata_cols].to_dict(orient='records')

songs = recommend_songs([{'name': 'Come As You Are', 'year':1991},
                {'name': 'Smells Like Teen Spirit', 'year': 1991},
                {'name': 'Lithium', 'year': 1992},
                {'name': 'All Apologies', 'year': 1993},
                {'name': 'Stay Away', 'year': 1993}],  data)

pprint(songs)

Conclusion

To conclude, we made a simple algorithm for generating song recommendations by pulling them out of a Spotify dataset. I learned a lot while working on this project, and I hope it proves itself helpful to you as well.