Random Forest with Embeddings

So far we have created both a random forest and a NN to do tabular modeling. One thing intresting about a NN is that it contains embeddings. Why don't we try to use these embeddings from the Neural Network in Random Forests? Will it improve the random forest? Lets find out!

Unzipping data

import zipfile

z= zipfile.ZipFile('bluebook-for-bulldozers.zip') #unzip first
z.extractall() #extract

Grabbing the Data

Similer to what we did in lesson 9, we will grab our data, set the ordinal var, and feature engineer the date.

df_nn = pd.read_csv(Path()/'TrainAndValid.csv', low_memory=False) #Data

#Set ordinal variables using our order
sizes = 'Large','Large / Medium','Medium','Small','Mini','Compact'
df_nn['ProductSize'] = df_nn['ProductSize'].astype('category')
df_nn['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)

dep_var = 'SalePrice'
df_nn[dep_var] = np.log(df_nn[dep_var]) #remember we need to take log of the label (Kaggle requires)

df_nn = add_datepart(df_nn, 'saledate') #Also remember that we used feature engineering on date

Continous and Categorical columns

cont_nn,cat_nn = cont_cat_split(df_nn, max_card=9000, dep_var=dep_var) #Max_card makes it so that any col with more than
                                                                            # 9000 lvls, it will be treated as cont

cont_nn

['SalesID', 'MachineID', 'auctioneerID', 'MachineHoursCurrentMeter']

Notice that it's missing saleElpased from the cont_nn. We need to add this as we want this col to be treated as cont.

cont_nn.append('saleElapsed')
cont_nn

['SalesID',
 'MachineID',
 'auctioneerID',
 'MachineHoursCurrentMeter',
 'saleElapsed']

cat_nn.remove('saleElapsed')

df_nn.dtypes['saleElapsed'] #must change to int as an object type will cause error

dtype('O')

df_nn['saleElapsed'] = df_nn['saleElapsed'].astype('int')

Split

We want to split our data by date, not randomly.

cond = (df_nn.saleYear<2011) | (df_nn.saleMonth<10)
train_idx = np.where( cond)[0]
valid_idx = np.where(~cond)[0]

splits = (list(train_idx),list(valid_idx))

Tabular object

Now that we have everything we need, lets create our tabular object.

procs_nn = [Categorify, FillMissing, Normalize]

to_nn = TabularPandas(df_nn, procs_nn, cat_nn, cont_nn, splits=splits, y_names=dep_var)

dls = to_nn.dataloaders(1024) #minibatches

y = to_nn.train.y
y.min(),y.max()

(8.465899, 11.863583)

Training

learn = tabular_learner(dls, y_range=(8,12), layers=[500,250],
                        n_out=1, loss_func=F.mse_loss)

learn.lr_find() #find best lr

SuggestedLRs(lr_min=0.0033113110810518267, lr_steep=0.00019054606673307717)

learn.fit_one_cycle(5, 1e-2) #train

def r_mse(pred,y): return round(math.sqrt(((pred-y)**2).mean()), 6)
def m_rmse(m, xs, y): return r_mse(m.predict(xs), y)

preds,targs = learn.get_preds()
r_mse(preds,targs)

0.221972

This is actually very good

Random Forest

Lets now create our random forest and compare it to our NN

Tabular object

procs = [Categorify, FillMissing]
rf_to = TabularPandas(df_nn, procs, cat_nn, cont_nn, y_names=dep_var, splits=splits)

xs,y = rf_to.train.xs,rf_to.train.y 
valid_xs,valid_y = rf_to.valid.xs,rf_to.valid.y

Random Forest

def rf(xs, y, n_estimators=40, max_samples=200_000, max_features=0.5, min_samples_leaf=5, **kwargs):
    return RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators,
                                max_samples=max_samples, max_features=max_features,
                                min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)

m = rf(xs, y) #Fitting

m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)

(0.171432, 0.233555)

So it seems our random forest preformed worse in comparison to the NN. Let's improve this by adding the NN embeddings!

Adding embeddings

learn.model.embeds[:5] #These are just some of the embedding within the NN

ModuleList(
  (0): Embedding(54, 15)
  (1): Embedding(5242, 194)
  (2): Embedding(7, 5)
  (3): Embedding(73, 18)
  (4): Embedding(4, 3)
)

The function below extracts the embeddings from the model

def embed_features(learner, xs):
    xs = xs.copy()
    for i, feature in enumerate(learn.dls.cat_names):
        emb = learner.model.embeds[i].cpu()
        new_feat = pd.DataFrame(emb(tensor(xs[feature], dtype=torch.int64)), index=xs.index, columns=[f'{feature}_{j}' for j in range(emb.embedding_dim)])
        xs.drop(columns=feature, inplace=True)
        xs = xs.join(new_feat)
    return xs

embeded_xs = embed_features(learn, learn.dls.train.xs)
xs_valid = embed_features(learn, learn.dls.valid.xs)

embeded_xs.shape, xs_valid.shape

((404710, 907), (7988, 907))

Fitting embeddings

Now that we have our embeddings, lets fit it into the random forest.

m = rf(embeded_xs, y) #Fitting

m_rmse(m, embeded_xs, y), m_rmse(m, xs_valid, valid_y)

(0.14817, 0.228745)

It seems that adding the NN embeddings improves the random forest!

epoch	train_loss	valid_loss	time
0	0.058050	0.054965	00:14
1	0.047368	0.052232	00:14
2	0.041544	0.050312	00:14
3	0.035930	0.049067	00:14
4	0.031330	0.049272	00:14