Collaborative Filtering Deep Dive

Collaborative filtering is a technique used by recommender systems. We will be taking a look at a movie reccomendation model.

A First Look at the Data

from fastai.collab import *
from fastai.tabular.all import *

path = untar_data(URLs.ML_100k)

ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user','movie','rating','timestamp'])
ratings.head()

Lets simulate

Below we are simulating the reccomendation model. Here we assume know what the Latent Factors, but in reality we do not and need to determine them.

                        #Sci-fi, action, old
last_skywalker = np.array([0.98,0.9,-0.9])

user1 = np.array([0.9,0.8,-0.6])

(user1*last_skywalker).sum()

2.1420000000000003

Pos val means that the user probably will like it

casablanca = np.array([-0.99,-0.3,0.8])

(user1*casablanca).sum()

-1.611

Neg val means that the user might not like it

movies = pd.read_csv(path/'u.item',  delimiter='|', encoding='latin-1',
                     usecols=(0,1), names=('movie','title'), header=None)
movies.head()

Lets merge our two tables

ratings = ratings.merge(movies)
ratings.head()

Creating dataloader

dls = CollabDataLoaders.from_df(ratings, user_name = 'user', item_name='title', bs=64) #must pass the correct columns
dls.show_batch()

dls.classes #We have the user and title class
len(dls.classes['user']),

(944,)

Initialize Parameters

n_users  = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 5 #Number of latent factors

user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)

It turns out we can represent looking up an index as a matrix. See below

one_hot_3 = one_hot(3, n_users).float()
one_hot_3[:10]

tensor([0., 0., 0., 1., 0., 0., 0., 0., 0., 0.])

user_factors[3] #parameter (latent factors) values at this index are

tensor([-0.4586, -0.9915, -0.4052, -0.3621, -0.5908])

user_factors.t() @ one_hot_3

tensor([ 0.4286,  0.8374, -0.5413, -1.6935,  0.1618])

Notice same values

End Sidebar

Collaborative Filtering from Scratch

Lets put what we did above into a class. This class will initialize the parameters for us and forward pass as well.

class DotProduct(Module): #extends Module class
    def __init__(self, n_users, n_movies, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        
    def forward(self, x): #Method called auto. anytime using Module class
        
        users = self.user_factors(x[:,0]) #user ID's
        movies = self.movie_factors(x[:,1]) #movie ID's
        return (users * movies).sum(dim=1) #dim=0  is the minibatches, we want to sum over the other dim (1)

x,y = dls.one_batch()
x.shape

torch.Size([64, 2])

x[:3] #user ID, movie ID

tensor([[655, 256],
        [298, 329],
        [862, 185]], device='cuda:0')

y[:3] #ratings

tensor([[5],
        [4],
        [4]], device='cuda:0', dtype=torch.int8)

Training

model = DotProduct(n_users, n_movies, 50) #our model

learn = Learner(dls, model, loss_func=MSELossFlat())

learn.fit_one_cycle(5, 5e-3)

Not bad, but we can do better!

Improving the model

We can improve our model by giving it a range as its a regression model. Passing it the range of (0-5.5) should improve the performance.

class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return sigmoid_range((users * movies).sum(dim=1), *self.y_range)

model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

Didn't really improve, but thats ok

Further improving the model

We should also add a bias as some rating may be skewing the data.

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        
        self.user_bias = Embedding(n_users, 1)
        self.movie_bias = Embedding(n_movies, 1)
        
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users * movies).sum(dim=1, keepdim=True)
        
        #bias
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
        return sigmoid_range(res, *self.y_range)

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

Loss not improving

It seems like our loss is not improving regardless of the improvements: But if you take another look you should realise that it performs better during the earlier epochs (2 or 3). This means that we are overfitting the model. But how can we train it for more epochs without overfitting? This is where weight regularization comes in.

Weight Decay

Weight decay is a regularization technique that penalizes the model for large weights. Overall, it prevents the model from overfitting during training.

x = np.linspace(-2,2,100)
a_s = [1,2,5,10,50] 
ys = [a * x**2 for a in a_s]
_,ax = plt.subplots(figsize=(8,6))
for a,y in zip(a_s,ys): ax.plot(x,y, label=f'a={a}')
ax.set_ylim([0,5])
ax.legend();

Graphic above demonstrates weight decay

Train with weight decay

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1) #Pass wd

Nice our loss dropped down to .82! Also, notice that train_loss increased, this is because wd is preventing the model from overfitting

Using FastAI ToolKit

Can also achieve same results using FastAI ToolKit. Notice we changed from Learner to collab_learner.

learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))

learn.fit_one_cycle(5, 5e-3, wd=0.1)

movie_bias = learn.model.i_bias.weight.squeeze()
idxs = movie_bias.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in idxs]

So far we have been using the predefined Embedding, why don't we create our own?

class T(Module):
    def __init__(self): self.a = torch.ones(3) 

L(T().parameters())# calling method from Module class

(#0) []

class T(Module):
    def __init__(self): self.a = nn.Parameter(torch.ones(3)) #Must wrap it with nn.Parameter()

L(T().parameters())

(#1) [Parameter containing:
tensor([1., 1., 1.], requires_grad=True)]

class T(Module):
    def __init__(self): self.a = nn.Linear(1, 3, bias=False)

t = T()
L(t.parameters())

(#1) [Parameter containing:
tensor([[-0.4927],
        [ 0.4325],
        [ 0.5283]], requires_grad=True)]

def create_params(size):
    return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))

This is all we need to create our own embedding

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = create_params([n_users, n_factors])
        self.user_bias = create_params([n_users])
        self.movie_factors = create_params([n_movies, n_factors])
        self.movie_bias = create_params([n_movies])
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors[x[:,0]]
        movies = self.movie_factors[x[:,1]]
        res = (users*movies).sum(dim=1)
        res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
        return sigmoid_range(res, *self.y_range)

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

Notice similer performance

End Sidebar

Looking inside model

We can take a look inside our model by calling learn.model.

movie_bias = learn.model.movie_bias.squeeze() #Grab movie by bias

idxs = movie_bias.argsort()[:5] #Sort by least bias
[dls.classes['title'][i] for i in idxs]

['Children of the Corn: The Gathering (1996)',
 'Robocop 3 (1993)',
 'Lawnmower Man 2: Beyond Cyberspace (1996)',
 'Amityville 3-D (1983)',
 'Mortal Kombat: Annihilation (1997)']

Least bias movies

idxs = movie_bias.argsort(descending=True)[:5] #Sort by most bias
[dls.classes['title'][i] for i in idxs]

['Titanic (1997)',
 'L.A. Confidential (1997)',
 'Silence of the Lambs, The (1991)',
 'Shawshank Redemption, The (1994)',
 'Star Wars (1977)']

Most bias movies

g = ratings.groupby('title')['rating'].count()
top_movies = g.sort_values(ascending=False).index.values[:1000]
top_idxs = tensor([learn.dls.classes['title'].o2i[m] for m in top_movies])
movie_w = learn.model.movie_factors[top_idxs].cpu().detach()
movie_pca = movie_w.pca(3)
fac0,fac1,fac2 = movie_pca.t()
idxs = list(range(50))
X = fac0[idxs]
Y = fac2[idxs]
plt.figure(figsize=(12,12))
plt.scatter(X, Y)
for i, x, y in zip(top_movies[idxs], X, Y):
    plt.text(x,y,i, color=np.random.rand(3)*0.7, fontsize=11)
plt.show()

Notice that similer movies have been clumped togather

Embedding Distance

We can also use simple math to find similer movies

movie_factors = learn.model.movie_factors
idx = dls.classes['title'].o2i['Forrest Gump (1994)']
distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])
idx = distances.argsort(descending=True)[1]
dls.classes['title'][idx]

'Affair to Remember, An (1957)'

Another approach to a Collaborative Filtering Model

embs = get_emb_sz(dls)
embs

[(944, 74), (1665, 102)]

class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        
        self.layers = nn.Sequential( 
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1))
        self.y_range = y_range
        
    def forward(self, x):
        embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
        x = self.layers(torch.cat(embs, dim=1))
        return sigmoid_range(x, *self.y_range)

model = CollabNN(*embs)

learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.01)

Using FastAI ToolKit

Can also achieve same results using FastAI ToolKit. Just enable use_nn

learn = collab_learner(dls, use_nn=True, y_range=(0, 5.5), layers=[100,50])
learn.fit_one_cycle(5, 5e-3, wd=0.1)

Notice similer results

type(learn.model)

fastai.collab.EmbeddingNN

@delegates(TabularModel)
class EmbeddingNN(TabularModel):
    def __init__(self, emb_szs, layers, **kwargs):
        super().__init__(emb_szs, layers=layers, n_cont=0, out_sz=1, **kwargs)

End sidebar

Conclusion

Overall, I hope you learned how to create a reccomendation model. Some very important concepts were viewed at, such as latent factors and weight decay.

Questionnaire

What problem does collaborative filtering solve?
Obtains latent factors needed to provided a good reccomendation.
How does it solve it?
It learns the latent factors via gradient descent and clumps up similer kind of factors.
Why might a collaborative filtering predictive model fail to be a very useful recommendation system?
If there is a lack of data from users to provide useful reccomendations.
What does a crosstab representation of collaborative filtering data look like?
Crosstab is where the colomn tabs are users, the row tabs are items, and values are filled out based on the user’s rating of the items.
Write the code to create a crosstab representation of the MovieLens data (you might need to do some web searching!).
What is a latent factor? Why is it "latent"?
Latent factos are the factors used to determine a prediction. They are latent as they are learned, NOT given to the model.
What is a dot product? Calculate a dot product manually using pure Python with lists.
The dot product is the sum of the products of the corresponding matrixs.
```
a = [1,2,3]
     b = [1,2,3]

     sum(i[0]*i[1] for i in zip(a,b))
```
What does pandas.DataFrame.merge do?
Merges two DataFrame's togather
What is an embedding matrix?
It is what you multiply an embedding with.
What is the relationship between an embedding and a matrix of one-hot-encoded vectors?
The embedding is a matrix of a one-hot encoded vecotors, but is more computationally efficient.
Why do we need Embedding if we could use one-hot-encoded vectors for the same thing?
More computationally efficient and fastser.
What does an embedding contain before we start training (assuming we're not using a pretained model)?
It is randomly initialized.

Create a class (without peeking, if possible!) and use it.

class Name:
 def __init__def(self):
     pass

 def func_name(self):
     pass

What does x[:,0] return?
User ids

Rewrite the DotProduct class (without peeking, if possible!) and train a model with it.

class DotProduct(Module):
 def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
     self.user_factors = Embedding(n_users, n_factors)
     self.movie_factors = Embedding(n_movies, n_factors)
     self.y_range = y_range

 def forward(self, x):
     users = self.user_factors(x[:,0])
     movies = self.movie_factors(x[:,1])
     return sigmoid_range((users * movies).sum(dim=1), *self.y_range)

What is a good loss function to use for MovieLens? Why?
Mean Squared Error Loss. Can use to compare how far the prediction is from the label.
What would happen if we used cross-entropy loss with MovieLens? How would we need to change the model?
We would need the model to output more predictions, only then can we pass it to the cross-entropy.
What is the use of bias in a dot product model?
Some rating may skeew the data, so a bias can help.
What is another name for weight decay?
L2 regularization
Write the equation for weight decay (without peeking!).
loss_with_wd = loss + wd * (parameters**2).sum()
Write the equation for the gradient of weight decay. Why does it help reduce weights?
Add to the gradients 2wdparameters. Overall, this prevents overfitting by enforcing more evenly distributed weights.
Why does reducing weights lead to better generalization?
By enforcing more evenly distributed weights, the result is is more shallow, with less sharp surfaces.
What does argsort do in PyTorch?
Returns index values of the order of the original list.
Does sorting the movie biases give the same result as averaging overall movie ratings by movie? Why/why not?
No. It takes into account other factors which can influence the results.
How do you print the names and details of the layers in a model?
learn.model
What is the "bootstrapping problem" in collaborative filtering?
The model cannot make any recommendations without enough data
How could you deal with the bootstrapping problem for new users? For new movies?
Have them complete a questionair.
How can feedback loops impact collaborative filtering systems?
May cause the model to suffer from bias.
When using a neural network in collaborative filtering, why can we have different numbers of factors for movies and users?
Because we are not taking the dot product, and instead concatenating the embedding matrices, different number of factors is alright.
Why is there an nn.Sequential in the CollabNN model?
Can create nonlinearity as we can couple multiple layers together.
What kind of model should we use if we want to add metadata about users and items, or information such as date and time, to a collaborative filtering model?
Tabular model

Further Research

Take a look at all the differences between the Embedding version of DotProductBias and the create_params version, and try to understand why each of those changes is required. If you're not sure, try reverting each change to see what happens. (NB: even the type of brackets used in forward has changed!)
Find three other areas where collaborative filtering is being used, and find out what the pros and cons of this approach are in those areas.
Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book's website and the fast.ai forum for ideas. Note that there are more columns in the full dataset—see if you can use those too (the next chapter might give you ideas).
Create a model for MovieLens that works with cross-entropy loss, and compare it to the model in this chapter.
Completed, see here https://usama280.github.io/PasteBlogs/

	user	movie	rating	timestamp
0	196	242	3	881250949
1	186	302	3	891717742
2	22	377	1	878887116
3	244	51	2	880606923
4	166	346	1	886397596

	movie	title
0	1	Toy Story (1995)
1	2	GoldenEye (1995)
2	3	Four Rooms (1995)
3	4	Get Shorty (1995)
4	5	Copycat (1995)

	user	movie	rating	timestamp	title
0	196	242	3	881250949	Kolya (1996)
1	63	242	3	875747190	Kolya (1996)
2	226	242	5	883888671	Kolya (1996)
3	154	242	3	879138235	Kolya (1996)
4	306	242	5	876503793	Kolya (1996)

	user	title	rating
0	542	My Left Foot (1989)	4
1	422	Event Horizon (1997)	3
2	311	African Queen, The (1951)	4
3	595	Face/Off (1997)	4
4	617	Evil Dead II (1987)	1
5	158	Jurassic Park (1993)	5
6	836	Chasing Amy (1997)	3
7	474	Emma (1996)	3
8	466	Jackie Chan's First Strike (1996)	3
9	554	Scream (1996)	3

epoch	train_loss	valid_loss	time
0	1.325196	1.266774	00:11
1	1.029445	1.041569	00:11
2	0.941498	0.949599	00:11
3	0.823250	0.879281	00:11
4	0.758273	0.859028	00:10

Lesson 8 - FastAI

Collaborative Filtering Deep Dive

A First Look at the Data

Lets simulate

Lets merge our two tables

Creating dataloader

Initialize Parameters

Sidebar: Indexing

End Sidebar

Collaborative Filtering from Scratch

Training

Improving the model

Further improving the model

Loss not improving

Weight Decay

Train with weight decay

Using FastAI ToolKit

Sidebar: Creating Our Own Embedding Module

End Sidebar

Looking inside model

Embedding Distance

Sidebar: Bootstrapping a Collaborative Filtering Model

Using FastAI ToolKit

End sidebar

Conclusion

Questionnaire

Further Research

epoch	train_loss	valid_loss	time
0	1.004179	0.974668	00:10
1	0.874107	0.902177	00:11
2	0.700155	0.859749	00:10
3	0.500587	0.870938	00:10
4	0.378323	0.876955	00:10

epoch	train_loss	valid_loss	time
0	0.956606	0.929338	00:13
1	0.799096	0.850351	00:13
2	0.607060	0.845230	00:15
3	0.403307	0.869548	00:14
4	0.289079	0.877028	00:14

epoch	train_loss	valid_loss	time
0	0.959248	0.956658	00:10
1	0.870590	0.876975	00:09
2	0.738598	0.837762	00:09
3	0.593487	0.822684	00:10
4	0.483328	0.823074	00:09

epoch	train_loss	valid_loss	time
0	0.931780	0.960767	00:10
1	0.881490	0.876014	00:09
2	0.762739	0.831577	00:09
3	0.590236	0.823882	00:09
4	0.493440	0.824700	00:09

epoch	train_loss	valid_loss	time
0	0.960358	0.956795	00:10
1	0.869042	0.874685	00:10
2	0.737840	0.839419	00:10
3	0.589841	0.823726	00:10
4	0.472334	0.824282	00:10

epoch	train_loss	valid_loss	time
0	0.919071	0.944800	00:11
1	0.920309	0.907606	00:10
2	0.844579	0.880101	00:10
3	0.810155	0.865898	00:10
4	0.746803	0.869486	00:10

epoch	train_loss	valid_loss	time
0	0.966049	0.983382	00:13
1	0.891749	0.926404	00:13
2	0.863655	0.885933	00:13
3	0.825138	0.864230	00:13
4	0.740462	0.860209	00:13