A Language Model from Scratch

We have worked with NLP models in the previous lecture, and have seen the many benefits and capabilities of such a model. Now let's try to create our very own model from scratch!

The Data

from fastai.text.all import *
path = untar_data(URLs.HUMAN_NUMBERS)
path.ls()
(#2) [Path('valid.txt'),Path('train.txt')]
lines = L()
with open(path/'train.txt') as f: lines += L(*f.readlines())
with open(path/'valid.txt') as f: lines += L(*f.readlines())
lines
(#9998) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...]

This is what our data looks like right now

text = ' . '.join([l.strip() for l in lines]) #Reformating
text[:100]
'one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo'
tokens = text.split(' ') #Now lets tokenize it
tokens[:10]
['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']
vocab = L(*tokens).unique() #lets create our vocab
vocab
(#30) ['one','.','two','three','four','five','six','seven','eight','nine'...]

This will be our vocab:Lets also numericalize it.

word2idx = {w:i for i,w in enumerate(vocab)} #Dictionary of word:id

nums = L(word2idx[i] for i in tokens) #Numericalization
tokens[:10], nums
(['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.'],
 (#63095) [0,1,2,1,3,1,4,1,5,1...])

Dataloader

L((tokens[i:i+3], tokens[i+3]) for i in range(0,len(tokens)-4,3))
(#21031) [(['one', '.', 'two'], '.'),(['.', 'three', '.'], 'four'),(['four', '.', 'five'], '.'),(['.', 'six', '.'], 'seven'),(['seven', '.', 'eight'], '.'),(['.', 'nine', '.'], 'ten'),(['ten', '.', 'eleven'], '.'),(['.', 'twelve', '.'], 'thirteen'),(['thirteen', '.', 'fourteen'], '.'),(['.', 'fifteen', '.'], 'sixteen')...]

Tokens are created so that the 4th token is the label.

seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0,len(nums)-4,3)) #Lets do the above, but this time using 
seqs                                                                       #numericalization form
(#21031) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]
bs = 64
cut = int(len(seqs) * 0.8) #80% training set, 20% valid set

dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False) 
dls.one_batch()[0][:2]
tensor([[0, 1, 2],
        [1, 3, 1]])

Our Language Model in PyTorch

Below is our language model. As you see, we have created three layers:

  • The embedding layer (i_h, for input to hidden)
  • The linear layer to create the activations for the next word (h_h, for hidden to hidden)
  • A final linear layer to predict the fourth word (h_o, for hidden to output)
class LMModel1(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)#input
        self.h_h = nn.Linear(n_hidden, n_hidden)#hidden     
        self.h_o = nn.Linear(n_hidden,vocab_sz)#output
        
    def forward(self, x):
        h = F.relu(self.h_h(self.i_h(x[:,0]))) #word 1
        h = h + self.i_h(x[:,1]) #word 2
        h = F.relu(self.h_h(h))
        h = h + self.i_h(x[:,2]) #word 3
        h = F.relu(self.h_h(h))
        return self.h_o(h) #pred (word 4)

Train

learn = Learner(dls, LMModel1(len(vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)
epoch train_loss valid_loss accuracy time
0 1.794209 2.036811 0.466128 00:03
1 1.384254 1.801755 0.473734 00:03
2 1.404778 1.655324 0.494176 00:04
3 1.369884 1.709227 0.423104 00:03

Awesome we created our first NLP model!

n,counts = 0,torch.zeros(len(vocab))
for x,y in dls.valid:
    n += y.shape[0]
    for i in range_of(vocab): counts[i] += (y==i).long().sum()
idx = torch.argmax(counts)
idx, vocab[idx.item()], counts[idx].item()/n
(tensor(29), 'thousand', 0.15165200855716662)

Our acc would have been .15 had we used an naive model

Refining our model - Recurrent Neural Network

Lets refactor our above model.

class LMModel2(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        
    def forward(self, x):
        h = 0
        for i in range(3): #lets use a forloop to create layers
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
        return self.h_o(h)

Notice that here h is set to 0 after every batch

learn = Learner(dls, LMModel2(len(vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)
epoch train_loss valid_loss accuracy time
0 1.876980 2.084122 0.410744 00:03
1 1.407598 1.821299 0.467316 00:03
2 1.410389 1.680269 0.490373 00:03
3 1.372142 1.709884 0.415498 00:03

Roughly the same as we expected. However, what we have created this time is actually an RNN!

Improving the RNN

Maintaining the State of an RNN

The first way we can improve our model is by actually remembering the state of h from the previous batches. Recall that before we set h=0 after every batch.

class LMModel3(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = 0
        
    def forward(self, x):
        for i in range(3):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
        out = self.h_o(self.h)
        
        self.h = self.h.detach() #detach throws away the stored gradients - However, the activations are still stored
        return out
    
    #Start of each epoch, we should reset out h
    def reset(self): self.h = 0
m = len(seqs)//bs
m,bs,len(seqs)
(328, 64, 21031)

Minibatches

Recall from Lecture 10 how we created the minibatches. Where the nth row from a minibatch followed the nth row from the previous minibatch. The function below does exactly that for us.

def group_chunks(ds, bs):
    m = len(ds) // bs
    new_ds = L()
    for i in range(m): new_ds += L(ds[i + m*j] for j in range(bs))
    return new_ds
group_chunks(seqs, bs)[1:4]
(#3) [(tensor([3, 1, 2]), 28),(tensor([28, 24,  2]), 1),(tensor([ 1,  6, 28]), 25)]
seqs[m,m*2,m*3]
(#3) [(tensor([3, 1, 2]), 28),(tensor([28, 24,  2]), 1),(tensor([ 1,  6, 28]), 25)]

Lets recreate our dataloader using our improved minibatch format

cut = int(len(seqs) * 0.8)

dls = DataLoaders.from_dsets(
    group_chunks(seqs[:cut], bs), 
    group_chunks(seqs[cut:], bs), 
    bs=bs, drop_last=True, shuffle=False)
learn = Learner(dls, LMModel3(len(vocab), 64), loss_func=F.cross_entropy,
                metrics=accuracy, cbs=ModelResetter) #This will call our reset function
learn.fit_one_cycle(10, 3e-3)
epoch train_loss valid_loss accuracy time
0 1.706451 1.823746 0.443510 00:04
1 1.282585 1.720615 0.455048 00:03
2 1.100162 1.534231 0.531250 00:03
3 1.031388 1.547766 0.532933 00:03
4 0.971291 1.532978 0.558654 00:03
5 0.929672 1.446295 0.571154 00:03
6 0.883135 1.520370 0.588221 00:03
7 0.824741 1.607137 0.599038 00:03
8 0.789257 1.675977 0.594952 00:03
9 0.776834 1.629597 0.596875 00:03

Creating More Signal

Rather than predicting every 4th word, why don't we predict every other word.

sl = 16
seqs = L((tensor(nums[i:i+sl]), tensor(nums[i+1:i+sl+1]))
         for i in range(0,len(nums)-sl-1,sl))

cut = int(len(seqs) * 0.8)

#dataloader
dls = DataLoaders.from_dsets(group_chunks(seqs[:cut], bs),
                             group_chunks(seqs[cut:], bs),
                             bs=bs, drop_last=True, shuffle=False)
seqs[0]
(tensor([0, 1, 2, 1, 3, 1, 4, 1, 5, 1, 6, 1, 7, 1, 8, 1]),
 tensor([1, 2, 1, 3, 1, 4, 1, 5, 1, 6, 1, 7, 1, 8, 1, 9]))
[vocab[s] for s in seqs[0]]
[(#16) ['one','.','two','.','three','.','four','.','five','.'...],
 (#16) ['.','two','.','three','.','four','.','five','.','six'...]]
class LMModel4(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = 0
        
    def forward(self, x):
        outs = [] #list of output
        
        for i in range(sl):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
            
            outs.append(self.h_o(self.h)) #append
            
        self.h = self.h.detach()
        return torch.stack(outs, dim=1) #stack of outputs
    
    def reset(self): self.h = 0
def loss_func(inp, targ):
    return F.cross_entropy(inp.view(-1, len(vocab)), targ.view(-1))
learn = Learner(dls, LMModel4(len(vocab), 64), loss_func=loss_func,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)
epoch train_loss valid_loss accuracy time
0 3.208388 2.989674 0.220052 00:01
1 2.304600 1.926858 0.457845 00:01
2 1.737188 1.785296 0.450684 00:01
3 1.462938 1.722541 0.492106 00:01
4 1.269122 1.607646 0.568197 00:01
5 1.122454 1.725385 0.579508 00:01
6 0.989286 1.876261 0.620443 00:01
7 0.877782 2.080590 0.626383 00:01
8 0.779877 2.068581 0.646729 00:01
9 0.702537 2.105229 0.655518 00:01
10 0.648459 2.225554 0.670654 00:01
11 0.602616 2.259415 0.672607 00:01
12 0.571914 2.272124 0.676270 00:01
13 0.552240 2.258376 0.678874 00:01
14 0.540891 2.214495 0.678630 00:01

Better than before!

Multilayer RNNs

Lets create a deeper and more layered RNN. What's unique about this RNN is that each layer will have a different weight matrix. Lets change our model and see how it performs.

The Model

class LMModel5(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h = torch.zeros(n_layers, bs, n_hidden)
                                             #How many layer to stack  
        self.rnn = nn.RNN(n_hidden, n_hidden, n_layers, batch_first=True) #This is doing what our previous model did:
                                                                          #Looping of the layers      
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h) #Notice that we can do the loop by calling our RNN
        self.h = h.detach() 
        return self.h_o(res)
    
    def reset(self): self.h.zero_()
learn = Learner(dls, LMModel5(len(vocab), 64, 2), 
                loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)
epoch train_loss valid_loss accuracy time
0 3.014767 2.582862 0.420003 00:02
1 2.149063 1.779345 0.471354 00:02
2 1.704159 1.854296 0.351156 00:02
3 1.472523 1.680113 0.467692 00:02
4 1.299899 1.845994 0.488200 00:02
5 1.145692 2.308071 0.487874 00:02
6 1.022578 2.543387 0.480794 00:01
7 0.923336 2.659213 0.493815 00:02
8 0.822356 2.721887 0.509277 00:02
9 0.733957 2.826130 0.524740 00:02
10 0.663029 2.933543 0.532878 00:02
11 0.612702 2.961933 0.537842 00:02
12 0.577110 3.006170 0.538493 00:02
13 0.555790 3.018762 0.536133 00:02
14 0.544564 3.017484 0.538574 00:02

Our model did worse. Does that mean our model is bad? No, what most likley happened here is that our gradient has either exploded or disappeared.

LSTM

We can fix the issue of gradients exploding or disappearing by creating another type of architecture, LSTM.

Building an LSTM from Scratch

class LSTMCell(Module):
    def __init__(self, ni, nh):
        self.forget_gate = nn.Linear(ni + nh, nh)
        self.input_gate  = nn.Linear(ni + nh, nh)
        self.cell_gate   = nn.Linear(ni + nh, nh)
        self.output_gate = nn.Linear(ni + nh, nh)

    def forward(self, input, state):
        h,c = state
        h = torch.cat([h, input], dim=1)
        forget = torch.sigmoid(self.forget_gate(h))
        c = c * forget
        inp = torch.sigmoid(self.input_gate(h))
        cell = torch.tanh(self.cell_gate(h))
        c = c + inp * cell
        out = torch.sigmoid(self.output_gate(h))
        h = out * torch.tanh(c)
        return h, (h,c)

We can refactor the above code

class LSTMCell(Module):
    def __init__(self, ni, nh):
        self.ih = nn.Linear(ni,4*nh)
        self.hh = nn.Linear(nh,4*nh)

    def forward(self, input, state):
        h,c = state
        # One big multiplication for all the gates is better than 4 smaller ones
        gates = (self.ih(input) + self.hh(h)).chunk(4, 1)
        ingate,forgetgate,outgate = map(torch.sigmoid, gates[:3])
        cellgate = gates[3].tanh()

        c = (forgetgate*c) + (ingate*cellgate)
        h = outgate * c.tanh()
        return h, (h,c)
t = torch.arange(0,10); t
tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
t.chunk(2)
(tensor([0, 1, 2, 3, 4]), tensor([5, 6, 7, 8, 9]))

Training a Language Model Using LSTMs

class LMModel6(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)] #more hidden state layers because LSTM has more layers
        
        self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True) #Replace our RNN to LSTM
        
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = [h_.detach() for h_ in h]
        return self.h_o(res)
    
    def reset(self): 
        for h in self.h: h.zero_()
learn = Learner(dls, LMModel6(len(vocab), 64, 2), 
                loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 1e-2)
epoch train_loss valid_loss accuracy time
0 3.012262 2.700189 0.295003 00:02
1 2.168225 1.905885 0.366048 00:03
2 1.620007 1.739222 0.475830 00:03
3 1.364854 2.005928 0.523926 00:02
4 1.133465 2.469636 0.541504 00:03
5 0.916891 2.274207 0.563883 00:03
6 0.715401 2.295597 0.635661 00:03
7 0.535663 2.418355 0.629964 00:03
8 0.387026 2.185029 0.680094 00:03
9 0.284487 2.342169 0.701497 00:03
10 0.212791 2.192696 0.718994 00:03
11 0.153409 2.317826 0.720540 00:03
12 0.115390 2.283189 0.730957 00:03
13 0.092992 2.291156 0.729574 00:03
14 0.082400 2.273516 0.731771 00:03

Our model is doing much better now

Regularizing an LSTM

Will be using some regularization technique to improve our model's gradients, particularly Dropout. What dropout does is that it randomly drops some neurons each minibatch: This forces the model to become more robust by making it able to produce the correct prediction even with less neurons available.

Dropout

class Dropout(Module):
    def __init__(self, p): 
        self.p = p #probility that activation gets deleted
        
    def forward(self, x):
        if not self.training:  #NO DROPOUT DURING TESTING (Only occurs during training)
            return x
        
        mask = x.new(*x.shape).bernoulli_(1-self.p) #1's and 0's where 1-p is the prob that we get a 1
        return x * mask.div_(1-self.p)

Activation Regularization and Temporal Activation Regularization

Activation regularization (AR) and temporal activation regularization (TAR) are two regularization methods very similar to weight decay, which we have discussed before.

For activation regularization, it's the final activations produced by the LSTM that we will try to make as small as possible, instead of the weights.

loss += alpha * activations.pow(2).mean()

Temporal activation regularization is there to encourage that behavior by adding a penalty to the loss to make the difference between two consecutive activations as small as possible:

loss += beta * (activations[:,1:] - activations[:,:-1]).pow(2).mean()

Training a Weight-Tied Regularized LSTM

class LMModel7(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers, p):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
        
        self.drop = nn.Dropout(p) #Dropout
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        
        self.h_o.weight = self.i_h.weight #Hidden-to-output weights are set identical to input-to-hidden
        
        self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]
        
    def forward(self, x):
        raw,h = self.rnn(self.i_h(x), self.h)
        out = self.drop(raw)
        self.h = [h_.detach() for h_ in h]
        return self.h_o(out),raw,out
    
    def reset(self): 
        for h in self.h: h.zero_()

Notice that Hidden-to-output and input-to-hidden are linked by the same parameters

learn = Learner(dls, LMModel7(len(vocab), 64, 2, 0.5),
                loss_func=CrossEntropyLossFlat(), metrics=accuracy,
                cbs=[ModelResetter, RNNRegularizer(alpha=2, beta=1)]) #Although we didn't create our regularizer, we can still
                                                                      #pass it via cbs. 
learn = TextLearner(dls, LMModel7(len(vocab), 64, 2, 0.4),
                    loss_func=CrossEntropyLossFlat(), metrics=accuracy)

Calling TextLearner will automatically add ModelResetter, RNNRegularizer(alpha=2, beta=1) for us

learn.fit_one_cycle(15, 1e-2, wd=0.1)
epoch train_loss valid_loss accuracy time
0 2.797331 2.223156 0.435465 00:03
1 1.981550 1.755357 0.458740 00:03
2 1.272993 0.754981 0.765381 00:03
3 0.728812 0.600991 0.828125 00:03
4 0.439376 0.546993 0.836995 00:03
5 0.298545 0.453179 0.866781 00:03
6 0.224571 0.446198 0.865641 00:03
7 0.184994 0.472140 0.862793 00:03
8 0.159867 0.493649 0.847900 00:03
9 0.143191 0.476974 0.852458 00:03
10 0.131329 0.475887 0.851318 00:03
11 0.122343 0.522000 0.833333 00:02
12 0.115645 0.531508 0.827881 00:03
13 0.111532 0.502286 0.835856 00:04
14 0.108859 0.507470 0.835286 00:03

Conclusion

Overall, we were not only able to create a NLP model from scratch but also refine it using LSTM's and Dropout.

Questionnaire

  1. If the dataset for your project is so big and complicated that working with it takes a significant amount of time, what should you do?
    Create a simple dataset that allow for quick and easy prototyping.
  2. Why do we concatenate the documents in our dataset before creating a language model?
    This allows us to easily split up data into batches.
  3. To use a standard fully connected network to predict the fourth word given the previous three words, what two tweaks do we need to make to our model?
    Use same weight matrix for the three layers.
    Use the first word’s embeddings as activations to pass to linear layer, add the second word’s embeddings to the first layer’s output activations, and continues for rest of words.
  4. How can we share a weight matrix across multiple layers in PyTorch?
    Define one layer in the PyTorch model class and use it multiple times in the forward pass.
  5. Write a module that predicts the third word given the previous two words of a sentence, without peeking.

    class LMModel1(Module):
     def __init__(self, vocab_sz, n_hidden):
         self.i_h = nn.Embedding(vocab_sz, n_hidden)  
         self.h_h = nn.Linear(n_hidden, n_hidden)     
         self.h_o = nn.Linear(n_hidden,vocab_sz)
    
     def forward(self, x):
         h = 0
         for i in range(3):
             h = h + self.i_h(x[:,i])
             h = F.relu(self.h_h(h))
         return self.h_o(h)
    
  6. What is a recurrent neural network?
    A refactoring of a multi-layer neural network as a loop.
  7. What is "hidden state"?
    Hidden state's are the activations updated after each RNN step.
  8. What is the equivalent of hidden state in LMModel1?
    h
  9. To maintain the state in an RNN, why is it important to pass the text to the model in order?
    Because state is maintained over all batches, so order matters.
  10. What is an "unrolled" representation of an RNN?
    A representation without loops.
  11. Why can maintaining the hidden state in an RNN lead to memory and performance problems? How do we fix this problem?
    Backpropagation would cause it to calculate the gradients of all the past calls. This can be avoided using detach().
  12. What is "BPTT"?
    Calculating backpropagation only for the given batch (detach()).
  13. Write code to print out the first few batches of the validation set, including converting the token IDs back into English strings, as we showed for batches of IMDb data in <>.</strong>
    [vocab[s] for s in dls.one_batch[0]]
    
    </li>
  14. What does the ModelResetter callback do? Why do we need it?
    It calls our reset method, which resets our hidden state before every epoch.
  15. What are the downsides of predicting just one output word for each three input words?
    There is a lot of extra information for training the model that is not being used.
  16. Why do we need a custom loss function for LMModel4?
    We have a stacked output, which we need to flatten as CrossEntropyLoss expects flattened tensors.
  17. Why is the training of LMModel4 unstable?
    Because this network is very deep it leads gradient to explode or disappear
  18. In the unrolled representation, we can see that a recurrent neural network actually has many layers. So why do we need to stack RNNs to get better results?
    Because only one weight matrix is really being used. We can fix this by stacking.
  19. Draw a representation of a stacked (multilayer) RNN.
  20. Why should we get better results in an RNN if we call detach less often? Why might this not happen in practice with a simple RNN?
  21. Why can a deep network result in very large or very small activations? Why does this matter?
    Numbers that are slightly large or small can lead to the explosion or disappearance of the number after repeated multiplications. In deep networks, we have repeated matrix multiplications, so this is a big problem.
  22. In a computer's floating-point representation of numbers, which numbers are the most precise?
    Small numbers (Not too close to 0 however)
  23. Why do vanishing gradients prevent training?
    No gradients mean no change in weights
  24. Why does it help to have two hidden states in the LSTM architecture? What is the purpose of each one?
    One state remembers what happened earlier in the sentence, and the other predicts the next token.
  25. What are these two states called in an LSTM?
    Cell state (long short-term memory)
    Hidden state (prediction)
  26. What is tanh, and how is it related to sigmoid?
    A sigmoid function rescaled to the range of -1 to 1
  27. What is the purpose of this code in LSTMCell: h = torch.cat([h, input], dim=1)
    Joins the hidden state and the new input.
  28. What does chunk do in PyTorch?
    Splits tensor in equal sizes.
  29. Study the refactored version of LSTMCell carefully to ensure you understand how and why it does the same thing as the non-refactored version.
  30. Why can we use a higher learning rate for LMModel6?
    Because now that we are using an LSTM, we have a partial solution to exploding/vanishing gradients.
  31. What are the three regularization techniques used in an AWD-LSTM model?
    Dropout
    Activation regularization
    Temporal activation regularization
  32. What is "dropout"?
    Random removal of neurons
  33. Why do we scale the weights with dropout? Is this applied during training, inference, or both?
    The scale changes if we sum up activations, so to correct the scale, a division by (1-p) is applied. We applied this only during training, but can be done both ways.
  34. What is the purpose of this line from Dropout: if not self.training: return x
    Prevents the usage of dropout during testing.
  35. Experiment with bernoulli_ to understand how it works.
  36. How do you set your model in training mode in PyTorch? In evaluation mode?
    Module.train(), Module.eval()
  37. Write the equation for activation regularization (in math or code, as you prefer). How is it different from weight decay?
    loss += alpha * activations.pow(2).mean()
    
    It's different because here we are not decreasing the weights, rather the activations
  38. Write the equation for temporal activation regularization (in math or code, as you prefer). Why wouldn't we use this for computer vision problems?
    loss += beta * (activations[:,1:] - activations[:,:-1]).pow(2).mean()
    
    This focuses on making the activations of consecutive tokens to be similar:
  39. What is "weight tying" in a language model?
    Where weights of hidden-to-output layer is the same as input-to-hidden.
  40. </ol> </div> </div> </div>

    Further Research

    1. In LMModel2, why can forward start with h=0? Why don't we need to say h=torch.zeros(...)?
    2. Write the code for an LSTM from scratch (you may refer to <>).</li>
    3. Search the internet for the GRU architecture and implement it from scratch, and try training a model. See if you can get results similar to those we saw in this chapter. Compare you results to the results of PyTorch's built in GRU module.
    4. Take a look at the source code for AWD-LSTM in fastai, and try to map each of the lines of code to the concepts shown in this chapter.
    5. </ol> </div> </div> </div> </div>