Collaborative Filtering Deep Dive

Collaborative filtering is a technique used by recommender systems. We will be taking a look at a movie reccomendation model.

A First Look at the Data

from fastai.collab import *
from fastai.tabular.all import *

path = untar_data(URLs.ML_100k)
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user','movie','rating','timestamp'])
ratings.head()
user movie rating timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596

Lets simulate

Below we are simulating the reccomendation model. Here we assume know what the Latent Factors, but in reality we do not and need to determine them.

                        #Sci-fi, action, old
last_skywalker = np.array([0.98,0.9,-0.9])
user1 = np.array([0.9,0.8,-0.6])
(user1*last_skywalker).sum()
2.1420000000000003

Pos val means that the user probably will like it

casablanca = np.array([-0.99,-0.3,0.8])
(user1*casablanca).sum() 
-1.611

Neg val means that the user might not like it

movies = pd.read_csv(path/'u.item',  delimiter='|', encoding='latin-1',
                     usecols=(0,1), names=('movie','title'), header=None)
movies.head() 
movie title
0 1 Toy Story (1995)
1 2 GoldenEye (1995)
2 3 Four Rooms (1995)
3 4 Get Shorty (1995)
4 5 Copycat (1995)

Lets merge our two tables

ratings = ratings.merge(movies)
ratings.head()
user movie rating timestamp title
0 196 242 3 881250949 Kolya (1996)
1 63 242 3 875747190 Kolya (1996)
2 226 242 5 883888671 Kolya (1996)
3 154 242 3 879138235 Kolya (1996)
4 306 242 5 876503793 Kolya (1996)

Creating dataloader

dls = CollabDataLoaders.from_df(ratings, user_name = 'user', item_name='title', bs=64) #must pass the correct columns
dls.show_batch()
user title rating
0 542 My Left Foot (1989) 4
1 422 Event Horizon (1997) 3
2 311 African Queen, The (1951) 4
3 595 Face/Off (1997) 4
4 617 Evil Dead II (1987) 1
5 158 Jurassic Park (1993) 5
6 836 Chasing Amy (1997) 3
7 474 Emma (1996) 3
8 466 Jackie Chan's First Strike (1996) 3
9 554 Scream (1996) 3
dls.classes #We have the user and title class
len(dls.classes['user']), 
(944,)

Initialize Parameters

n_users  = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 5 #Number of latent factors

user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)

Sidebar: Indexing

It turns out we can represent looking up an index as a matrix. See below

one_hot_3 = one_hot(3, n_users).float()
one_hot_3[:10] 
tensor([0., 0., 0., 1., 0., 0., 0., 0., 0., 0.])
user_factors[3] #parameter (latent factors) values at this index are
tensor([-0.4586, -0.9915, -0.4052, -0.3621, -0.5908])
user_factors.t() @ one_hot_3
tensor([ 0.4286,  0.8374, -0.5413, -1.6935,  0.1618])

Notice same values

End Sidebar

Collaborative Filtering from Scratch

Lets put what we did above into a class. This class will initialize the parameters for us and forward pass as well.

class DotProduct(Module): #extends Module class
    def __init__(self, n_users, n_movies, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        
    def forward(self, x): #Method called auto. anytime using Module class
        
        users = self.user_factors(x[:,0]) #user ID's
        movies = self.movie_factors(x[:,1]) #movie ID's
        return (users * movies).sum(dim=1) #dim=0  is the minibatches, we want to sum over the other dim (1)
x,y = dls.one_batch()
x.shape
torch.Size([64, 2])
x[:3] #user ID, movie ID
tensor([[655, 256],
        [298, 329],
        [862, 185]], device='cuda:0')
y[:3] #ratings
tensor([[5],
        [4],
        [4]], device='cuda:0', dtype=torch.int8)

Training

model = DotProduct(n_users, n_movies, 50) #our model

learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)
epoch train_loss valid_loss time
0 1.325196 1.266774 00:11
1 1.029445 1.041569 00:11
2 0.941498 0.949599 00:11
3 0.823250 0.879281 00:11
4 0.758273 0.859028 00:10

Not bad, but we can do better!

Improving the model

We can improve our model by giving it a range as its a regression model. Passing it the range of (0-5.5) should improve the performance.

class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return sigmoid_range((users * movies).sum(dim=1), *self.y_range)
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)
epoch train_loss valid_loss time
0 1.004179 0.974668 00:10
1 0.874107 0.902177 00:11
2 0.700155 0.859749 00:10
3 0.500587 0.870938 00:10
4 0.378323 0.876955 00:10

Didn't really improve, but thats ok

Further improving the model

We should also add a bias as some rating may be skewing the data.

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        
        self.user_bias = Embedding(n_users, 1)
        self.movie_bias = Embedding(n_movies, 1)
        
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users * movies).sum(dim=1, keepdim=True)
        
        #bias
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
        return sigmoid_range(res, *self.y_range)
model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)
epoch train_loss valid_loss time
0 0.956606 0.929338 00:13
1 0.799096 0.850351 00:13
2 0.607060 0.845230 00:15
3 0.403307 0.869548 00:14
4 0.289079 0.877028 00:14

Loss not improving

It seems like our loss is not improving regardless of the improvements: But if you take another look you should realise that it performs better during the earlier epochs (2 or 3). This means that we are overfitting the model. But how can we train it for more epochs without overfitting? This is where weight regularization comes in.

Weight Decay

Weight decay is a regularization technique that penalizes the model for large weights. Overall, it prevents the model from overfitting during training.

x = np.linspace(-2,2,100)
a_s = [1,2,5,10,50] 
ys = [a * x**2 for a in a_s]
_,ax = plt.subplots(figsize=(8,6))
for a,y in zip(a_s,ys): ax.plot(x,y, label=f'a={a}')
ax.set_ylim([0,5])
ax.legend();

Graphic above demonstrates weight decay

Train with weight decay

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1) #Pass wd
epoch train_loss valid_loss time
0 0.959248 0.956658 00:10
1 0.870590 0.876975 00:09
2 0.738598 0.837762 00:09
3 0.593487 0.822684 00:10
4 0.483328 0.823074 00:09

Nice our loss dropped down to .82! Also, notice that train_loss increased, this is because wd is preventing the model from overfitting

Using FastAI ToolKit

Can also achieve same results using FastAI ToolKit. Notice we changed from Learner to collab_learner.

learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))
learn.fit_one_cycle(5, 5e-3, wd=0.1)
epoch train_loss valid_loss time
0 0.931780 0.960767 00:10
1 0.881490 0.876014 00:09
2 0.762739 0.831577 00:09
3 0.590236 0.823882 00:09
4 0.493440 0.824700 00:09
movie_bias = learn.model.i_bias.weight.squeeze()
idxs = movie_bias.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in idxs]

Sidebar: Creating Our Own Embedding Module

So far we have been using the predefined Embedding, why don't we create our own?

class T(Module):
    def __init__(self): self.a = torch.ones(3) 

L(T().parameters())# calling method from Module class
(#0) []
class T(Module):
    def __init__(self): self.a = nn.Parameter(torch.ones(3)) #Must wrap it with nn.Parameter()

L(T().parameters())
(#1) [Parameter containing:
tensor([1., 1., 1.], requires_grad=True)]
class T(Module):
    def __init__(self): self.a = nn.Linear(1, 3, bias=False)

t = T()
L(t.parameters())
(#1) [Parameter containing:
tensor([[-0.4927],
        [ 0.4325],
        [ 0.5283]], requires_grad=True)]
def create_params(size):
    return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))

This is all we need to create our own embedding

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = create_params([n_users, n_factors])
        self.user_bias = create_params([n_users])
        self.movie_factors = create_params([n_movies, n_factors])
        self.movie_bias = create_params([n_movies])
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors[x[:,0]]
        movies = self.movie_factors[x[:,1]]
        res = (users*movies).sum(dim=1)
        res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
        return sigmoid_range(res, *self.y_range)
model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)
epoch train_loss valid_loss time
0 0.960358 0.956795 00:10
1 0.869042 0.874685 00:10
2 0.737840 0.839419 00:10
3 0.589841 0.823726 00:10
4 0.472334 0.824282 00:10

Notice similer performance

End Sidebar

Looking inside model

We can take a look inside our model by calling learn.model.

movie_bias = learn.model.movie_bias.squeeze() #Grab movie by bias

idxs = movie_bias.argsort()[:5] #Sort by least bias
[dls.classes['title'][i] for i in idxs]
['Children of the Corn: The Gathering (1996)',
 'Robocop 3 (1993)',
 'Lawnmower Man 2: Beyond Cyberspace (1996)',
 'Amityville 3-D (1983)',
 'Mortal Kombat: Annihilation (1997)']

Least bias movies

idxs = movie_bias.argsort(descending=True)[:5] #Sort by most bias
[dls.classes['title'][i] for i in idxs]
['Titanic (1997)',
 'L.A. Confidential (1997)',
 'Silence of the Lambs, The (1991)',
 'Shawshank Redemption, The (1994)',
 'Star Wars (1977)']

Most bias movies

g = ratings.groupby('title')['rating'].count()
top_movies = g.sort_values(ascending=False).index.values[:1000]
top_idxs = tensor([learn.dls.classes['title'].o2i[m] for m in top_movies])
movie_w = learn.model.movie_factors[top_idxs].cpu().detach()
movie_pca = movie_w.pca(3)
fac0,fac1,fac2 = movie_pca.t()
idxs = list(range(50))
X = fac0[idxs]
Y = fac2[idxs]
plt.figure(figsize=(12,12))
plt.scatter(X, Y)
for i, x, y in zip(top_movies[idxs], X, Y):
    plt.text(x,y,i, color=np.random.rand(3)*0.7, fontsize=11)
plt.show()

Notice that similer movies have been clumped togather

Embedding Distance

We can also use simple math to find similer movies

movie_factors = learn.model.movie_factors
idx = dls.classes['title'].o2i['Forrest Gump (1994)']
distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])
idx = distances.argsort(descending=True)[1]
dls.classes['title'][idx]
'Affair to Remember, An (1957)'

Sidebar: Bootstrapping a Collaborative Filtering Model

Another approach to a Collaborative Filtering Model

embs = get_emb_sz(dls)
embs
[(944, 74), (1665, 102)]
class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        
        self.layers = nn.Sequential( 
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1))
        self.y_range = y_range
        
    def forward(self, x):
        embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
        x = self.layers(torch.cat(embs, dim=1))
        return sigmoid_range(x, *self.y_range)
model = CollabNN(*embs)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.01)
epoch train_loss valid_loss time
0 0.919071 0.944800 00:11
1 0.920309 0.907606 00:10
2 0.844579 0.880101 00:10
3 0.810155 0.865898 00:10
4 0.746803 0.869486 00:10

Using FastAI ToolKit

Can also achieve same results using FastAI ToolKit. Just enable use_nn

learn = collab_learner(dls, use_nn=True, y_range=(0, 5.5), layers=[100,50])
learn.fit_one_cycle(5, 5e-3, wd=0.1)
epoch train_loss valid_loss time
0 0.966049 0.983382 00:13
1 0.891749 0.926404 00:13
2 0.863655 0.885933 00:13
3 0.825138 0.864230 00:13
4 0.740462 0.860209 00:13

Notice similer results

type(learn.model)
fastai.collab.EmbeddingNN
@delegates(TabularModel)
class EmbeddingNN(TabularModel):
    def __init__(self, emb_szs, layers, **kwargs):
        super().__init__(emb_szs, layers=layers, n_cont=0, out_sz=1, **kwargs)

End sidebar

Conclusion

Overall, I hope you learned how to create a reccomendation model. Some very important concepts were viewed at, such as latent factors and weight decay.

Questionnaire

  1. What problem does collaborative filtering solve?
    Obtains latent factors needed to provided a good reccomendation.
  2. How does it solve it?
    It learns the latent factors via gradient descent and clumps up similer kind of factors.
  3. Why might a collaborative filtering predictive model fail to be a very useful recommendation system?
    If there is a lack of data from users to provide useful reccomendations.
  4. What does a crosstab representation of collaborative filtering data look like?
    Crosstab is where the colomn tabs are users, the row tabs are items, and values are filled out based on the user’s rating of the items.
  5. Write the code to create a crosstab representation of the MovieLens data (you might need to do some web searching!).

  6. What is a latent factor? Why is it "latent"?
    Latent factos are the factors used to determine a prediction. They are latent as they are learned, NOT given to the model.

  7. What is a dot product? Calculate a dot product manually using pure Python with lists.
    The dot product is the sum of the products of the corresponding matrixs.

    a = [1,2,3]
         b = [1,2,3]
    
         sum(i[0]*i[1] for i in zip(a,b))
    
  8. What does pandas.DataFrame.merge do?
    Merges two DataFrame's togather
  9. What is an embedding matrix?
    It is what you multiply an embedding with.
  10. What is the relationship between an embedding and a matrix of one-hot-encoded vectors?
    The embedding is a matrix of a one-hot encoded vecotors, but is more computationally efficient.
  11. Why do we need Embedding if we could use one-hot-encoded vectors for the same thing?
    More computationally efficient and fastser.
  12. What does an embedding contain before we start training (assuming we're not using a pretained model)?
    It is randomly initialized.
  13. Create a class (without peeking, if possible!) and use it.

    class Name:
     def __init__def(self):
         pass
    
     def func_name(self):
         pass
    
  14. What does x[:,0] return?
    User ids
  15. Rewrite the DotProduct class (without peeking, if possible!) and train a model with it.

    class DotProduct(Module):
     def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
         self.user_factors = Embedding(n_users, n_factors)
         self.movie_factors = Embedding(n_movies, n_factors)
         self.y_range = y_range
    
     def forward(self, x):
         users = self.user_factors(x[:,0])
         movies = self.movie_factors(x[:,1])
         return sigmoid_range((users * movies).sum(dim=1), *self.y_range)
    
  16. What is a good loss function to use for MovieLens? Why?
    Mean Squared Error Loss. Can use to compare how far the prediction is from the label.
  17. What would happen if we used cross-entropy loss with MovieLens? How would we need to change the model?
    We would need the model to output more predictions, only then can we pass it to the cross-entropy.
  18. What is the use of bias in a dot product model?
    Some rating may skeew the data, so a bias can help.
  19. What is another name for weight decay?
    L2 regularization
  20. Write the equation for weight decay (without peeking!).
    loss_with_wd = loss + wd * (parameters**2).sum()
  21. Write the equation for the gradient of weight decay. Why does it help reduce weights?
    Add to the gradients 2wdparameters. Overall, this prevents overfitting by enforcing more evenly distributed weights.
  22. Why does reducing weights lead to better generalization?
    By enforcing more evenly distributed weights, the result is is more shallow, with less sharp surfaces.
  23. What does argsort do in PyTorch?
    Returns index values of the order of the original list.
  24. Does sorting the movie biases give the same result as averaging overall movie ratings by movie? Why/why not?
    No. It takes into account other factors which can influence the results.
  25. How do you print the names and details of the layers in a model?
    learn.model
  26. What is the "bootstrapping problem" in collaborative filtering?
    The model cannot make any recommendations without enough data
  27. How could you deal with the bootstrapping problem for new users? For new movies?
    Have them complete a questionair.
  28. How can feedback loops impact collaborative filtering systems?
    May cause the model to suffer from bias.
  29. When using a neural network in collaborative filtering, why can we have different numbers of factors for movies and users?
    Because we are not taking the dot product, and instead concatenating the embedding matrices, different number of factors is alright.
  30. Why is there an nn.Sequential in the CollabNN model?
    Can create nonlinearity as we can couple multiple layers together.
  31. What kind of model should we use if we want to add metadata about users and items, or information such as date and time, to a collaborative filtering model?
    Tabular model

Further Research

  1. Take a look at all the differences between the Embedding version of DotProductBias and the create_params version, and try to understand why each of those changes is required. If you're not sure, try reverting each change to see what happens. (NB: even the type of brackets used in forward has changed!)

  2. Find three other areas where collaborative filtering is being used, and find out what the pros and cons of this approach are in those areas.

  3. Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book's website and the fast.ai forum for ideas. Note that there are more columns in the full dataset—see if you can use those too (the next chapter might give you ideas).

  4. Create a model for MovieLens that works with cross-entropy loss, and compare it to the model in this chapter.
    Completed, see here https://usama280.github.io/PasteBlogs/