Making Movie Recommendation System With AI – Part 1
In this post, we’ll be making a movie recommendation system using Tensorflow. Furthermore, this is going to be the first part of the whole thing.
To explain, whole recommendation system is going to pick movies in 2 steps, we call retrieval and ranking. Therefore, we’ll focus on the retrieval process first here. Additionally, the ranking process serves as a kind of a fine tuning upgrade to the retrieval process.
In this tutorial, we’ll be working with movielens dataset, which contains 100k (100.000) ratings and movies. We’ll also import it from tensorflow datasets, since it’s readily available to us in just a few lines of code.
Prerequisites
Like with any other python project we do, first thing we need to take care of is import all the necessary libraries. Among following libraries will also be Tensorflow recommenders. This includes specialized functionality for building recommendation system components.
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs
from typing import Dict, Text
Data preprocessing
In order to begin with this part of the process, we need to import the dataset first.
# Ratings data.
ratings = tfds.load("movielens/100k-ratings", split="train")
# Features of all the available movies.
movies = tfds.load("movielens/100k-movies", split="train")
We’ll need to simplify this dataset, because, by default, we imported a whole bunch of features for each sample. For ratings dataset, we’ll only use movie titles and user ids, while for movies dataset movie titles will be enough.
ratings = ratings.map(lambda x: {
"movie_title": x["movie_title"],
"user_id": x["user_id"],
})
movies = movies.map(lambda x: x["movie_title"])
In the following steps, we’ll fully prepare the dataset by shuffling, splitting it into training and testing sets, batching, and caching them. We’re also going to store unique values for movies and user ids into numpy arrays.
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)
train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)
cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()
movie_titles = movies.batch(1_000)
user_ids = ratings.batch(1_000_000).map(lambda x: x["user_id"])
unique_movie_titles = np.unique(np.concatenate(list(movie_titles)))
unique_user_ids = np.unique(np.concatenate(list(user_ids)))
Building model for movie recommendation system
We’ll be making a model with 2 separate models, one for user id data, and one for movie titles data. Furthermore, we’ll use a fairly simple model architecture for each, but you can experiment and make it more complicated.
So, without further a do, let’s define them.
embedding_dimension = 32
user_model = tf.keras.Sequential([
tf.keras.layers.StringLookup(
vocabulary=unique_user_ids, mask_token=None),
tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension)
])
movie_model = tf.keras.Sequential([
tf.keras.layers.StringLookup(
vocabulary=unique_movie_titles, mask_token=None),
tf.keras.layers.Embedding(len(unique_movie_titles) + 1, embedding_dimension)
])
Next, we need to define a Retrieval
task object, which is essentially a loss function bundled with metric computation.
metrics = tfrs.metrics.FactorizedTopK(
candidates=movies.batch(128).map(movie_model)
)
task = tfrs.tasks.Retrieval(
metrics=metrics
)
Using all these components, we can create a class for the full model for our movie recommendation system.
class MovielensModel(tfrs.Model):
def __init__(self, user_model, movie_model):
super().__init__()
self.movie_model: tf.keras.Model = movie_model
self.user_model: tf.keras.Model = user_model
self.task: tf.keras.layers.Layer = task
def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
# We pick out the user features and pass them into the user model.
user_embeddings = self.user_model(features["user_id"])
# And pick out the movie features and pass them into the movie model,
# getting embeddings back.
positive_movie_embeddings = self.movie_model(features["movie_title"])
# The task computes the loss and the metrics.
return self.task(user_embeddings, positive_movie_embeddings)
Training our movie recommendation system
The following part is really simple, since we already prepared everything. All we need to do is define the model and call the functions for compiling, fitting and evaluating it.
model = MovielensModel(user_model, movie_model)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))
model.fit(cached_train, epochs=5)
model.evaluate(cached_test, return_dict=True)
Predicting recommended movies for specific user id
Finally, for the last step of this tutorial, we’ll predict 3 movies, which we’ll recommend to a certain user.
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
# recommends movies out of the entire movies dataset.
index.index_from_dataset(
tf.data.Dataset.zip((movies.batch(100), movies.batch(100).map(model.movie_model)))
)
# Get recommendations.
_, titles = index(tf.constant(["42"]))
print(f"Recommendations for user 42: {titles[0, :3]}")
Conclusion
To conclude, we made a simple movie recommendation system using Tensorflow. However, this is only part 1 of the whole thing, covering only the retrieval process. As a follow up to this tutorial, you should also check out Part 2, where we further fine tune the selection process.
I hope this tutorial helped you gain a better understanding how recommendation systems work.