A Language Model from Scratch

We have worked with NLP models in the previous lecture, and have seen the many benefits and capabilities of such a model. Now let's try to create our very own model from scratch!

The Data

from fastai.text.all import *
path = untar_data(URLs.HUMAN_NUMBERS)

path.ls()

(#2) [Path('valid.txt'),Path('train.txt')]

lines = L()
with open(path/'train.txt') as f: lines += L(*f.readlines())
with open(path/'valid.txt') as f: lines += L(*f.readlines())
lines

(#9998) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...]

This is what our data looks like right now

text = ' . '.join([l.strip() for l in lines]) #Reformating
text[:100]

'one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo'

tokens = text.split(' ') #Now lets tokenize it
tokens[:10]

['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']

vocab = L(*tokens).unique() #lets create our vocab
vocab

(#30) ['one','.','two','three','four','five','six','seven','eight','nine'...]

This will be our vocab:Lets also numericalize it.

word2idx = {w:i for i,w in enumerate(vocab)} #Dictionary of word:id

nums = L(word2idx[i] for i in tokens) #Numericalization
tokens[:10], nums

(['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.'],
 (#63095) [0,1,2,1,3,1,4,1,5,1...])

Dataloader

L((tokens[i:i+3], tokens[i+3]) for i in range(0,len(tokens)-4,3))

(#21031) [(['one', '.', 'two'], '.'),(['.', 'three', '.'], 'four'),(['four', '.', 'five'], '.'),(['.', 'six', '.'], 'seven'),(['seven', '.', 'eight'], '.'),(['.', 'nine', '.'], 'ten'),(['ten', '.', 'eleven'], '.'),(['.', 'twelve', '.'], 'thirteen'),(['thirteen', '.', 'fourteen'], '.'),(['.', 'fifteen', '.'], 'sixteen')...]

Tokens are created so that the 4th token is the label.

seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0,len(nums)-4,3)) #Lets do the above, but this time using 
seqs                                                                       #numericalization form

(#21031) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]

bs = 64
cut = int(len(seqs) * 0.8) #80% training set, 20% valid set

dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)

dls.one_batch()[0][:2]

tensor([[0, 1, 2],
        [1, 3, 1]])

Our Language Model in PyTorch

Below is our language model. As you see, we have created three layers:

The embedding layer (i_h, for input to hidden)
The linear layer to create the activations for the next word (h_h, for hidden to hidden)
A final linear layer to predict the fourth word (h_o, for hidden to output)

class LMModel1(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)#input
        self.h_h = nn.Linear(n_hidden, n_hidden)#hidden     
        self.h_o = nn.Linear(n_hidden,vocab_sz)#output
        
    def forward(self, x):
        h = F.relu(self.h_h(self.i_h(x[:,0]))) #word 1
        h = h + self.i_h(x[:,1]) #word 2
        h = F.relu(self.h_h(h))
        h = h + self.i_h(x[:,2]) #word 3
        h = F.relu(self.h_h(h))
        return self.h_o(h) #pred (word 4)

Train

learn = Learner(dls, LMModel1(len(vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

Awesome we created our first NLP model!

n,counts = 0,torch.zeros(len(vocab))
for x,y in dls.valid:
    n += y.shape[0]
    for i in range_of(vocab): counts[i] += (y==i).long().sum()
idx = torch.argmax(counts)
idx, vocab[idx.item()], counts[idx].item()/n

(tensor(29), 'thousand', 0.15165200855716662)

Our acc would have been .15 had we used an naive model

Refining our model - Recurrent Neural Network

Lets refactor our above model.

class LMModel2(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        
    def forward(self, x):
        h = 0
        for i in range(3): #lets use a forloop to create layers
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
        return self.h_o(h)

Notice that here h is set to 0 after every batch

learn = Learner(dls, LMModel2(len(vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

Roughly the same as we expected. However, what we have created this time is actually an RNN!

Improving the RNN

Maintaining the State of an RNN

The first way we can improve our model is by actually remembering the state of h from the previous batches. Recall that before we set h=0 after every batch.

class LMModel3(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = 0
        
    def forward(self, x):
        for i in range(3):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
        out = self.h_o(self.h)
        
        self.h = self.h.detach() #detach throws away the stored gradients - However, the activations are still stored
        return out
    
    #Start of each epoch, we should reset out h
    def reset(self): self.h = 0

m = len(seqs)//bs
m,bs,len(seqs)

(328, 64, 21031)

Minibatches

Recall from Lecture 10 how we created the minibatches. Where the nth row from a minibatch followed the nth row from the previous minibatch. The function below does exactly that for us.

def group_chunks(ds, bs):
    m = len(ds) // bs
    new_ds = L()
    for i in range(m): new_ds += L(ds[i + m*j] for j in range(bs))
    return new_ds

group_chunks(seqs, bs)[1:4]

(#3) [(tensor([3, 1, 2]), 28),(tensor([28, 24,  2]), 1),(tensor([ 1,  6, 28]), 25)]

seqs[m,m*2,m*3]

(#3) [(tensor([3, 1, 2]), 28),(tensor([28, 24,  2]), 1),(tensor([ 1,  6, 28]), 25)]

Lets recreate our dataloader using our improved minibatch format

cut = int(len(seqs) * 0.8)

dls = DataLoaders.from_dsets(
    group_chunks(seqs[:cut], bs), 
    group_chunks(seqs[cut:], bs), 
    bs=bs, drop_last=True, shuffle=False)

learn = Learner(dls, LMModel3(len(vocab), 64), loss_func=F.cross_entropy,
                metrics=accuracy, cbs=ModelResetter) #This will call our reset function
learn.fit_one_cycle(10, 3e-3)

Creating More Signal

Rather than predicting every 4th word, why don't we predict every other word.

sl = 16
seqs = L((tensor(nums[i:i+sl]), tensor(nums[i+1:i+sl+1]))
         for i in range(0,len(nums)-sl-1,sl))

cut = int(len(seqs) * 0.8)

#dataloader
dls = DataLoaders.from_dsets(group_chunks(seqs[:cut], bs),
                             group_chunks(seqs[cut:], bs),
                             bs=bs, drop_last=True, shuffle=False)

seqs[0]

(tensor([0, 1, 2, 1, 3, 1, 4, 1, 5, 1, 6, 1, 7, 1, 8, 1]),
 tensor([1, 2, 1, 3, 1, 4, 1, 5, 1, 6, 1, 7, 1, 8, 1, 9]))

[vocab[s] for s in seqs[0]]

[(#16) ['one','.','two','.','three','.','four','.','five','.'...],
 (#16) ['.','two','.','three','.','four','.','five','.','six'...]]

class LMModel4(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = 0
        
    def forward(self, x):
        outs = [] #list of output
        
        for i in range(sl):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
            
            outs.append(self.h_o(self.h)) #append
            
        self.h = self.h.detach()
        return torch.stack(outs, dim=1) #stack of outputs
    
    def reset(self): self.h = 0

def loss_func(inp, targ):
    return F.cross_entropy(inp.view(-1, len(vocab)), targ.view(-1))

learn = Learner(dls, LMModel4(len(vocab), 64), loss_func=loss_func,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)

Better than before!

Multilayer RNNs

Lets create a deeper and more layered RNN. What's unique about this RNN is that each layer will have a different weight matrix. Lets change our model and see how it performs.

The Model

class LMModel5(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h = torch.zeros(n_layers, bs, n_hidden)
                                             #How many layer to stack  
        self.rnn = nn.RNN(n_hidden, n_hidden, n_layers, batch_first=True) #This is doing what our previous model did:
                                                                          #Looping of the layers      
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h) #Notice that we can do the loop by calling our RNN
        self.h = h.detach() 
        return self.h_o(res)
    
    def reset(self): self.h.zero_()

learn = Learner(dls, LMModel5(len(vocab), 64, 2), 
                loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)

Our model did worse. Does that mean our model is bad? No, what most likley happened here is that our gradient has either exploded or disappeared.

LSTM

We can fix the issue of gradients exploding or disappearing by creating another type of architecture, LSTM.

Building an LSTM from Scratch

class LSTMCell(Module):
    def __init__(self, ni, nh):
        self.forget_gate = nn.Linear(ni + nh, nh)
        self.input_gate  = nn.Linear(ni + nh, nh)
        self.cell_gate   = nn.Linear(ni + nh, nh)
        self.output_gate = nn.Linear(ni + nh, nh)

    def forward(self, input, state):
        h,c = state
        h = torch.cat([h, input], dim=1)
        forget = torch.sigmoid(self.forget_gate(h))
        c = c * forget
        inp = torch.sigmoid(self.input_gate(h))
        cell = torch.tanh(self.cell_gate(h))
        c = c + inp * cell
        out = torch.sigmoid(self.output_gate(h))
        h = out * torch.tanh(c)
        return h, (h,c)

We can refactor the above code

class LSTMCell(Module):
    def __init__(self, ni, nh):
        self.ih = nn.Linear(ni,4*nh)
        self.hh = nn.Linear(nh,4*nh)

    def forward(self, input, state):
        h,c = state
        # One big multiplication for all the gates is better than 4 smaller ones
        gates = (self.ih(input) + self.hh(h)).chunk(4, 1)
        ingate,forgetgate,outgate = map(torch.sigmoid, gates[:3])
        cellgate = gates[3].tanh()

        c = (forgetgate*c) + (ingate*cellgate)
        h = outgate * c.tanh()
        return h, (h,c)

t = torch.arange(0,10); t

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

t.chunk(2)

(tensor([0, 1, 2, 3, 4]), tensor([5, 6, 7, 8, 9]))

Training a Language Model Using LSTMs

class LMModel6(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)] #more hidden state layers because LSTM has more layers
        
        self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True) #Replace our RNN to LSTM
        
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = [h_.detach() for h_ in h]
        return self.h_o(res)
    
    def reset(self): 
        for h in self.h: h.zero_()

learn = Learner(dls, LMModel6(len(vocab), 64, 2), 
                loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 1e-2)

Our model is doing much better now

Regularizing an LSTM

Will be using some regularization technique to improve our model's gradients, particularly Dropout. What dropout does is that it randomly drops some neurons each minibatch: This forces the model to become more robust by making it able to produce the correct prediction even with less neurons available.

Dropout

class Dropout(Module):
    def __init__(self, p): 
        self.p = p #probility that activation gets deleted
        
    def forward(self, x):
        if not self.training:  #NO DROPOUT DURING TESTING (Only occurs during training)
            return x
        
        mask = x.new(*x.shape).bernoulli_(1-self.p) #1's and 0's where 1-p is the prob that we get a 1
        return x * mask.div_(1-self.p)

Activation Regularization and Temporal Activation Regularization

Activation regularization (AR) and temporal activation regularization (TAR) are two regularization methods very similar to weight decay, which we have discussed before.

For activation regularization, it's the final activations produced by the LSTM that we will try to make as small as possible, instead of the weights.

loss += alpha * activations.pow(2).mean()

Temporal activation regularization is there to encourage that behavior by adding a penalty to the loss to make the difference between two consecutive activations as small as possible:

loss += beta * (activations[:,1:] - activations[:,:-1]).pow(2).mean()

Training a Weight-Tied Regularized LSTM

class LMModel7(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers, p):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
        
        self.drop = nn.Dropout(p) #Dropout
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        
        self.h_o.weight = self.i_h.weight #Hidden-to-output weights are set identical to input-to-hidden
        
        self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]
        
    def forward(self, x):
        raw,h = self.rnn(self.i_h(x), self.h)
        out = self.drop(raw)
        self.h = [h_.detach() for h_ in h]
        return self.h_o(out),raw,out
    
    def reset(self): 
        for h in self.h: h.zero_()

Notice that Hidden-to-output and input-to-hidden are linked by the same parameters

learn = Learner(dls, LMModel7(len(vocab), 64, 2, 0.5),
                loss_func=CrossEntropyLossFlat(), metrics=accuracy,
                cbs=[ModelResetter, RNNRegularizer(alpha=2, beta=1)]) #Although we didn't create our regularizer, we can still
                                                                      #pass it via cbs.

learn = TextLearner(dls, LMModel7(len(vocab), 64, 2, 0.4),
                    loss_func=CrossEntropyLossFlat(), metrics=accuracy)

Calling TextLearner will automatically add ModelResetter, RNNRegularizer(alpha=2, beta=1) for us

learn.fit_one_cycle(15, 1e-2, wd=0.1)

Conclusion

Overall, we were not only able to create a NLP model from scratch but also refine it using LSTM's and Dropout.

Questionnaire

If the dataset for your project is so big and complicated that working with it takes a significant amount of time, what should you do?
Create a simple dataset that allow for quick and easy prototyping.
Why do we concatenate the documents in our dataset before creating a language model?
This allows us to easily split up data into batches.
To use a standard fully connected network to predict the fourth word given the previous three words, what two tweaks do we need to make to our model?
Use same weight matrix for the three layers.
Use the first word’s embeddings as activations to pass to linear layer, add the second word’s embeddings to the first layer’s output activations, and continues for rest of words.
How can we share a weight matrix across multiple layers in PyTorch?
Define one layer in the PyTorch model class and use it multiple times in the forward pass.

Write a module that predicts the third word given the previous two words of a sentence, without peeking.

class LMModel1(Module):
 def __init__(self, vocab_sz, n_hidden):
     self.i_h = nn.Embedding(vocab_sz, n_hidden)  
     self.h_h = nn.Linear(n_hidden, n_hidden)     
     self.h_o = nn.Linear(n_hidden,vocab_sz)

 def forward(self, x):
     h = 0
     for i in range(3):
         h = h + self.i_h(x[:,i])
         h = F.relu(self.h_h(h))
     return self.h_o(h)

What is a recurrent neural network?
A refactoring of a multi-layer neural network as a loop.
What is "hidden state"?
Hidden state's are the activations updated after each RNN step.
What is the equivalent of hidden state in LMModel1?
h
To maintain the state in an RNN, why is it important to pass the text to the model in order?
Because state is maintained over all batches, so order matters.
What is an "unrolled" representation of an RNN?
A representation without loops.
Why can maintaining the hidden state in an RNN lead to memory and performance problems? How do we fix this problem?
Backpropagation would cause it to calculate the gradients of all the past calls. This can be avoided using detach().
What is "BPTT"?
Calculating backpropagation only for the given batch (detach()).
Write code to print out the first few batches of the validation set, including converting the token IDs back into English strings, as we showed for batches of IMDb data in <>.</strong>
[vocab[s] for s in dls.one_batch[0]]
</li>
What does the ModelResetter callback do? Why do we need it?
It calls our reset method, which resets our hidden state before every epoch.

What are the downsides of predicting just one output word for each three input words?
There is a lot of extra information for training the model that is not being used.

Why do we need a custom loss function for LMModel4?
We have a stacked output, which we need to flatten as CrossEntropyLoss expects flattened tensors.

Why is the training of LMModel4 unstable?
Because this network is very deep it leads gradient to explode or disappear

In the unrolled representation, we can see that a recurrent neural network actually has many layers. So why do we need to stack RNNs to get better results?
Because only one weight matrix is really being used. We can fix this by stacking.

Draw a representation of a stacked (multilayer) RNN.

Why should we get better results in an RNN if we call detach less often? Why might this not happen in practice with a simple RNN?

Why can a deep network result in very large or very small activations? Why does this matter?
Numbers that are slightly large or small can lead to the explosion or disappearance of the number after repeated multiplications. In deep networks, we have repeated matrix multiplications, so this is a big problem.

In a computer's floating-point representation of numbers, which numbers are the most precise?
Small numbers (Not too close to 0 however)

Why do vanishing gradients prevent training?
No gradients mean no change in weights

Why does it help to have two hidden states in the LSTM architecture? What is the purpose of each one?
One state remembers what happened earlier in the sentence, and the other predicts the next token.

What are these two states called in an LSTM?
Cell state (long short-term memory)
Hidden state (prediction)

What is tanh, and how is it related to sigmoid?
A sigmoid function rescaled to the range of -1 to 1

What is the purpose of this code in LSTMCell: h = torch.cat([h, input], dim=1)
Joins the hidden state and the new input.

What does chunk do in PyTorch?
Splits tensor in equal sizes.

Study the refactored version of LSTMCell carefully to ensure you understand how and why it does the same thing as the non-refactored version.

Why can we use a higher learning rate for LMModel6?
Because now that we are using an LSTM, we have a partial solution to exploding/vanishing gradients.

What are the three regularization techniques used in an AWD-LSTM model?
Dropout
Activation regularization
Temporal activation regularization

What is "dropout"?
Random removal of neurons

Why do we scale the weights with dropout? Is this applied during training, inference, or both?
The scale changes if we sum up activations, so to correct the scale, a division by (1-p) is applied. We applied this only during training, but can be done both ways.

What is the purpose of this line from Dropout: if not self.training: return x
Prevents the usage of dropout during testing.

Experiment with bernoulli_ to understand how it works.

How do you set your model in training mode in PyTorch? In evaluation mode?
Module.train(), Module.eval()

Write the equation for activation regularization (in math or code, as you prefer). How is it different from weight decay?
loss += alpha * activations.pow(2).mean()
It's different because here we are not decreasing the weights, rather the activations

Write the equation for temporal activation regularization (in math or code, as you prefer). Why wouldn't we use this for computer vision problems?
loss += beta * (activations[:,1:] - activations[:,:-1]).pow(2).mean()
This focuses on making the activations of consecutive tokens to be similar:

What is "weight tying" in a language model?
Where weights of hidden-to-output layer is the same as input-to-hidden.
</ol> </div> </div> </div>

Further Research

In LMModel2, why can forward start with h=0? Why don't we need to say h=torch.zeros(...)?

Write the code for an LSTM from scratch (you may refer to <>).</li>
Search the internet for the GRU architecture and implement it from scratch, and try training a model. See if you can get results similar to those we saw in this chapter. Compare you results to the results of PyTorch's built in GRU module.

Take a look at the source code for AWD-LSTM in fastai, and try to map each of the lines of code to the concepts shown in this chapter.
</ol> </div> </div> </div> </div>

Further Research

In LMModel2, why can forward start with h=0? Why don't we need to say h=torch.zeros(...)?
Write the code for an LSTM from scratch (you may refer to <>).</li>
Search the internet for the GRU architecture and implement it from scratch, and try training a model. See if you can get results similar to those we saw in this chapter. Compare you results to the results of PyTorch's built in GRU module.
Take a look at the source code for AWD-LSTM in fastai, and try to map each of the lines of code to the concepts shown in this chapter.

epoch	train_loss	valid_loss	accuracy	time
0	1.794209	2.036811	0.466128	00:03
1	1.384254	1.801755	0.473734	00:03
2	1.404778	1.655324	0.494176	00:04
3	1.369884	1.709227	0.423104	00:03

epoch	train_loss	valid_loss	accuracy	time
0	1.876980	2.084122	0.410744	00:03
1	1.407598	1.821299	0.467316	00:03
2	1.410389	1.680269	0.490373	00:03
3	1.372142	1.709884	0.415498	00:03

epoch	train_loss	valid_loss	accuracy	time
0	1.706451	1.823746	0.443510	00:04
1	1.282585	1.720615	0.455048	00:03
2	1.100162	1.534231	0.531250	00:03
3	1.031388	1.547766	0.532933	00:03
4	0.971291	1.532978	0.558654	00:03
5	0.929672	1.446295	0.571154	00:03
6	0.883135	1.520370	0.588221	00:03
7	0.824741	1.607137	0.599038	00:03
8	0.789257	1.675977	0.594952	00:03
9	0.776834	1.629597	0.596875	00:03

epoch	train_loss	valid_loss	accuracy	time
0	3.208388	2.989674	0.220052	00:01
1	2.304600	1.926858	0.457845	00:01
2	1.737188	1.785296	0.450684	00:01
3	1.462938	1.722541	0.492106	00:01
4	1.269122	1.607646	0.568197	00:01
5	1.122454	1.725385	0.579508	00:01
6	0.989286	1.876261	0.620443	00:01
7	0.877782	2.080590	0.626383	00:01
8	0.779877	2.068581	0.646729	00:01
9	0.702537	2.105229	0.655518	00:01
10	0.648459	2.225554	0.670654	00:01
11	0.602616	2.259415	0.672607	00:01
12	0.571914	2.272124	0.676270	00:01
13	0.552240	2.258376	0.678874	00:01
14	0.540891	2.214495	0.678630	00:01

epoch	train_loss	valid_loss	accuracy	time
0	3.014767	2.582862	0.420003	00:02
1	2.149063	1.779345	0.471354	00:02
2	1.704159	1.854296	0.351156	00:02
3	1.472523	1.680113	0.467692	00:02
4	1.299899	1.845994	0.488200	00:02
5	1.145692	2.308071	0.487874	00:02
6	1.022578	2.543387	0.480794	00:01
7	0.923336	2.659213	0.493815	00:02
8	0.822356	2.721887	0.509277	00:02
9	0.733957	2.826130	0.524740	00:02
10	0.663029	2.933543	0.532878	00:02
11	0.612702	2.961933	0.537842	00:02
12	0.577110	3.006170	0.538493	00:02
13	0.555790	3.018762	0.536133	00:02
14	0.544564	3.017484	0.538574	00:02

epoch	train_loss	valid_loss	accuracy	time
0	3.012262	2.700189	0.295003	00:02
1	2.168225	1.905885	0.366048	00:03
2	1.620007	1.739222	0.475830	00:03
3	1.364854	2.005928	0.523926	00:02
4	1.133465	2.469636	0.541504	00:03
5	0.916891	2.274207	0.563883	00:03
6	0.715401	2.295597	0.635661	00:03
7	0.535663	2.418355	0.629964	00:03
8	0.387026	2.185029	0.680094	00:03
9	0.284487	2.342169	0.701497	00:03
10	0.212791	2.192696	0.718994	00:03
11	0.153409	2.317826	0.720540	00:03
12	0.115390	2.283189	0.730957	00:03
13	0.092992	2.291156	0.729574	00:03
14	0.082400	2.273516	0.731771	00:03

epoch	train_loss	valid_loss	accuracy	time
0	2.797331	2.223156	0.435465	00:03
1	1.981550	1.755357	0.458740	00:03
2	1.272993	0.754981	0.765381	00:03
3	0.728812	0.600991	0.828125	00:03
4	0.439376	0.546993	0.836995	00:03
5	0.298545	0.453179	0.866781	00:03
6	0.224571	0.446198	0.865641	00:03
7	0.184994	0.472140	0.862793	00:03
8	0.159867	0.493649	0.847900	00:03
9	0.143191	0.476974	0.852458	00:03
10	0.131329	0.475887	0.851318	00:03
11	0.122343	0.522000	0.833333	00:02
12	0.115645	0.531508	0.827881	00:03
13	0.111532	0.502286	0.835856	00:04
14	0.108859	0.507470	0.835286	00:03