[1/3] Recommendation: Vanilla pipeline for Recommender Systems

11 min readJun 21, 2022

by Paula Gómez Duran, PhD student at the University of Barcelona

Nowadays, there is more multimedia content available than ever before and so it can be difficult sometimes to determine what could actually be interesting for a given user. Recommender systems (RS) are a type of artificial intelligence that helps users find and select items (e.g., books, movies, restaurants, products) from the huge number available on the web or in other electronic information sources. RS are currently used to predict user’s preferences in order to recommend them items in which they could be interested in.

There are mainly three approaches that can be followed in order to build a recommender system:

Statistics. The basic approach of building a recommender system is by using statistics such as correlation or any others which measure similarity among users and items.Basically, this idea consists on grouping users who for example watched the same films, to then assume that those users have the same tastes. In that way, you will be able to recommend other films watched by any of those users to all the other users that were included into the initial group. Here you have an example of how to build a RS by using statistics and pandas.
Matrix Factorization. Decomposing a matrix A into two lower rank matrices (U, I) is a well known technique used by several methods such as SVD, clustering or QZ decomposition. It is used to know whether a single matrix encodes two different transformations inside, thus allowing to find data patterns into those two lower dimensionality matrices U and I. By applying the decomposition, we are able to know the similarity/taste score of a given user U₁₇ on a chosen item I₉ (supposing that U₁₇ did not previously consumed I₉ but we want to know wether to predict this recommendation or not).
AI models. Machine Learning and Deep Learning models raise the opportunity of treating any problem as a classification task. The main benefit of doing so it that, once a model is trained to solve a specific task (how to recommend items), it is then capable of generalizing to any other kind of situations.

In this post, we focus on the third technic. By using a basic AI model we intent to show how a Deep Learning pipeline needs to be built and how can we solve a recommendation problem by using a classification problem perspective.

RS as deep learning models could be divided in many different ways but the most common division is regarding whether they use collaborative filtering (CF), content-based filtering (CBF), or an hybrid combination. In this post we aim to provide an overview of how to train a vanilla RS, thus focusing more on defining a useful pipeline than exploring deeply which is the best model. For that reason we will stick to CF systems as they are proven to be effective and at the same time require less pre-processing data (i.e. collecting side-information).

If you want more information about CF vs CBF models you can review this post.

Hence, we define four sections that are worth to follow in order to understand how a RS should be built and evaluate:

Dataset: Training data and Test data
Model: What do we expect from a RS?
Inference: How to rank?
Evaluation: Metrics

Building a Recommender System pipeline

1. Dataset: Training data and Test data

The first step required to build a recommender system is data collection. Once the data has been collected and pre-processed (cleaned up and normalized if required) it needs to be split into data that we will use for the training stage (training data) and data that we will use to evaluate the model (test data).

In order to exemplify how to build a vanilla pipeline for recommendation, we choose the most popular database in this research topic: MovieLens database.

ML-100K dataset consist on 100,000 ratings from 943 users on 1,682 movies. There are other MovieLens datasets which are much bigger but we actually chose this data in order to ease the problem in terms of computational resources.

PRE-PROCESSING

Just below we show a sample of the data we will be working with. However, the rating from the original dataset consisted on values from 1 to 5, which would be useful for an explicit feedback problem, and here we have change all ratings to 1 in order work in implicit feedback scenarios.

Just in case you are not familiar with the difference between explicit and implicit feedback we provide just below a brief explanation. If you know the difference, you can skip the next bullet points.

Explicit feedback refers to that information collected by the system by explicitly asking users to rate how relevant are some items (either requiring a number, a comment or other information). Even though this lead to more precise information regarding user’s tastes, it is not promoted since it requires direct participation from the user and sometimes it influences in a bad way in terms of user experience.
Implicit feedback allude to no user participation required to gather information tastes. The system automatically tracks users’ preferences by monitoring the performed actions, such as which item they visited, where they clicked, which items they purchased, or how long they stayed on a web page.

Sample from the ML-100K data.

SPLIT

A random 80/20 train-test split strategy is often used when training and evaluating deep learning models. However, doing a random split in RS would not be fair, as we could potentially be using user’s recent reviews for training and past reviews for testing. This would introduce data leakage with a look-ahead bias and so the performance of the trained model would not be generalizable to real-world performance. Hence, the strategy that mimics the most a realistic scenario is the time-leave-one-out methodology (TLOO), which consists on holding out the most recent transaction of each user for testing (and if desired also one for validation) and treat the remaining data as the training set.

Note that the evaluation performed in this post is all about offline-evaluation. It could lead to very different results when performing online-evaluation if having the possibility, although of course, it could be conducted by following the same strategy.

The code below splits the dataset interactions into train and test set by using the time-leave-one-out methodology.

Just before splitting the data, we have also re-indexed the user and item ID’s so that each entity has a unique identifier. Hence, users have ID’s in range[0, 942] while items have ID’s in range [943, 2624]. Once split, we obtain the following shape for each set:

Train shape: (99057, 2)
Test shape: (943, 2)

Notice that the 943 samples from the test set correspond to a single sample for each user. Thus, summing them up to the training samples (99057) will result into the 100k total samples from the original dataset.

NEGATIVE SAMPLING

However, in order to train a classifier we need to feed to the model not only positive data but also negative one. For that reason, we need to perform negative sampling in order to generate those negative samples that we are actually missing. To do so, we first build the adjacency matrix as it is a very helpful way to organize interactions:

After the adjacency matrix function is created (which needs to be built just from the training data) we are ready to generate negative samples. As it is conventionally done in many RS papers, we sample 4 negative samples for each positive one.

So, after calling ng_sample() function by coding train, rating_mat = ng_sample(train_x, dims), we get our complete training set to look like:

where we can observe the user ID on the first column (here we just showed samples for user 0); the item ID’s on the second column, and the targets/labels on the third column (which indicates whether the <user, item> interaction did occur (1) or not (0)).

Note that for each positive sample (label = 1) we have sampled 4 negative samples (label = 0).

In order to wrap up the training data and allow it to be fed into the model in batches, we can code a Pytorch Dataset class that return the samples + the targets and which can be fed into a Pytorch Dataloader. Hence, this dataset is fed into a dataloader which batches the data thus preparing for the training stage.

Nowadays there are many different types of deep learning models inside collaborative filtering area, each with their own strengths and weaknesses depending on what type of problem you are trying to solve. Choosing an appropriate algorithm is essential for getting good results when recommending. Nevertheless, this topic is not going to be covered in this post as we intent to first explain the basic pipeline of how a RS should be built.

2. Model: What do we expect from a RS?

We just mentioned that the aim of this post is not to dig into different RS models to choose the best algorithm. However, we would like to provide at least an insight of what should we expect from a RS and what properties it should have, specifically, in terms of its general architecture.

The most common RS use algorithms to analyse data and make suggestions based on user’s interests. In fact, they seek to predict preferences of a given user and be able to recommend then products that are potentially interesting. But, how do they come to do it?

What we actually need is a way to represent user preferences as well as item’s descriptions. Embeddings are a good way to represent entities by using a continuous vector of higher dimension. By using proper embeddings to characterize users and items respectively, we are able to infer (i.e.: by applying the dot product between user & item embeddings) a score which represents how much a user was interested on a given item. The score is high if there is a potential preference between this user and item, and low if the user is not interested into that item.

Image representing an insight of how do RS work.

Model based collaborative approaches only rely on user-item interactions information and assume a latent model supposed to explain these interactions. For example, Matrix Factorization (MF) algorithms consists in decomposing the huge and sparse user-item interaction matrix into a product of two smaller and dense matrices: a user-factor matrix (containing users representations) and a factor-item matrix (containing items representations). The main assumption behind MF is that there exists a pretty low dimensional latent space of features in which we can represent both users and items, and that the interaction between a user and an item can be obtained by computing the dot product of corresponding dense vectors in that space.

Factorization Machines (FM) models are at the cutting edge of Machine Learning techniques for RS as they have proven to be an extremely powerful tool with enough expressive capacity to generalize MF method.

Factorization Machines equation from the original paper.

Thinking about how deep learning models work, we can analogously say that the embedding matrix (either representing users or items) starts with random representations and during each training epoch, the embeddings become more and more accurate in terms of representing either users or items respectively. This is done by using a cost function which compares the model output score with either one’s or zero’s depending on whether the interaction <user, item> did occur or not.

Image representing RS pipeline from Intro to RS blog.

In the colab notebook we better explain how to program FM model by following the equation term by term.

3. Inference: How to rank?

Once we have split the data, created the full training set (with positive and negative interactions) and defined the model architecture, it is the time to think about how must we evaluate in recommendation. Same way done with the splitting strategy, where we aimed to simulate a real scenario, we do it so for the inference stage. Thus, when we think about which should be the desired output in RS, we immediately come up with a ranking.

RS are focused on providing a ranking which basically contains the preferences of a user, where the most preferred item occupies the first position and the less liked item occupies the last position. Even this is true, note that we will probably just show the first 5 or 10 positions of this ranking list when using a RS, as its main goal is to help users find quickly those items or products which better suit their tastes. This means that the less liked items will never show up into the results we show to users, giving just relevance to the scores that correspond to the most liked items/products.

Hence, the way we will proceed is by computing, for each given user, the score corresponding to each of the items in the database (please, note that we will subtract those items already consumed by the user). Recalling what we previously explained, our test set is composed by a single interaction per user, being this interaction the ground-truth (GT) sample which contains the item to be consumed the next. Thus, what we expect when we produce the ranking for a given user is this GT sample to be (hopefully) into the first ranking position. So, taking advantage of the previously built adjacency matrix, we sample all those items which the user did not interact with and, then, we append those to the test set of that user (letting the GT sample — the only positive sample in the test set — always be in the first position).

Therefore, we reckon that for each user we do not have anymore just one sample in the test_x (GT sample held out originally) but besides a set of different items which correspond to those items over the ones the ranking should be computed(and where we know that the GT sample is stored on the first position).

Note that all the procedure is available on the colab notebook.

4. Evaluation: Metrics

Last but not least, we need to define the metrics that we will use to evaluate the RS. In this work we have chosen two of the most used metrics in this types of models: Hit Ratio (HR) and Normalized Cumulative Discounted Gain (NDCG). The first metric counts the number of users for who we hit the GT sample in the top@K (K stands for 5, 10, 15… depending the items we will display as ranking to the user), and the second metric represents the performance regarding how the ranking is sorted(higher if we hit in the top positions of the ranking and lower if we hit on the tail positions).

Example of how would be the test set from a user with the respective scores per interaction.

You can find more detailed explanation about the metrics in this post.

SUMMARY

To sum up, we can provide a brief description for each of the developed sections.

Dataset: In this section we explained how to generate negative data for implicit feedback scenarios as well as which split procedure should be carried out to mimic a realistic use case.
Model: In this section we talked about a very essential feature in recommendation: embeddings to represent user/item profiles. We also gave an insight about MF and FM models.
Inference: In this section we illustrated how ranking should be performed in order to leverage information from RS to the users.
Evaluation: In this section we described the metrics that serve us in order to evaluate the performance of a recommender system.

Now we know how to train a recommender system and in the next posts we will dig into other kind of models such as Graph Convolutional Networks or other related issues such as exposure bias or popularity bias, amongst others.