The MNIST Loss Function

In this lesson we will be recreating our ML model for the MNIST dataset.

Loading data from previous P1

path = untar_data(URLs.MNIST_SAMPLE)  #path for data
Path.BASE_PATH = path


threes = (path/'train'/'3').ls().sorted() #getting 3's data from path
sevens = (path/'train'/'7').ls().sorted() #getting 7's data from path

seven_tensors = [tensor(Image.open(o)) for o in sevens]
three_tensors = [tensor(Image.open(o)) for o in threes]

stacked_sevens = torch.stack(seven_tensors).float()/255
stacked_threes = torch.stack(three_tensors).float()/255
valid_3_tens = torch.stack([tensor(Image.open(o)) 
                            for o in (path/'valid'/'3').ls()])
valid_3_tens = valid_3_tens.float()/255


valid_7_tens = torch.stack([tensor(Image.open(o)) 
                            for o in (path/'valid'/'7').ls()])
valid_7_tens = valid_7_tens.float()/255

Processing/reshaping data

It is neccessary we reshape our data so we can have all the images laid our in a single matrix

train_x = torch.cat([stacked_threes, stacked_sevens]).view(-1, 28*28) #.view reshapes the image where each row has 1 image 
                                                                    # with all its content in a single row (each image is 28x28)
train_y = tensor([1]*len(threes) + [0]*len(sevens)).unsqueeze(1)
train_x.shape,train_y.shape
(torch.Size([12396, 784]), torch.Size([12396, 1]))
dset = list(zip(train_x,train_y)) #zip() creates a concatination of x,y
x,y = dset[0] 
x.shape,y
(torch.Size([784]), tensor([1]))
valid_x = torch.cat([valid_3_tens, valid_7_tens]).view(-1, 28*28)
valid_y = tensor([1]*len(valid_3_tens) + [0]*len(valid_7_tens)).unsqueeze(1)
valid_dset = list(zip(valid_x,valid_y))
def init_params(size, var=1.0): return (torch.randn(size)*var).requires_grad_()
weights = init_params((28*28,1)) #weights needed for every pixel, hence 28*28
bias = init_params(1) #Need bias because w*p = 0 when p=0 (p = pixel)
(train_x[0]*weights.T).sum() + bias #Must transpose so multi can happen
tensor([-6.2330], grad_fn=<AddBackward0>)
def linear1(xb): 
    return xb@weights + bias #@ repersents matrix multi

preds = linear1(train_x)
preds #preds of all images 
tensor([[ -6.2330],
        [-10.6388],
        [-20.8865],
        ...,
        [-15.9176],
        [ -1.6866],
        [-11.3568]], grad_fn=<AddBackward0>)

Sigmoid

Sigmoid is a function that is often used in ML to squish values between 0-1. As you may have noticed our predictions range very much. This can be an issue when handling the loss, so we make use of the sigmoid function to squish these values between 0-1.

def sigmoid(x): return 1/(1+torch.exp(-x))
plot_function(torch.sigmoid, title='Sigmoid', min=-4, max=4)
/opt/conda/envs/fastai/lib/python3.8/site-packages/fastbook/__init__.py:73: UserWarning: Not providing a value for linspace's steps is deprecated and will throw a runtime error in a future release. This warning will appear only once per process. (Triggered internally at  /opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/ATen/native/RangeFactories.cpp:23.)
  x = torch.linspace(min,max)
def mnist_loss(predictions, targets):
    predictions = predictions.sigmoid() #squishing predictions between 0-1
    return torch.where(targets==1, 1-predictions, predictions).mean()

Sidebar: SGD and Mini-Batches

coll = range(15)
dl = DataLoader(coll, batch_size=5, shuffle=True) #creates minibatches
list(dl)
[tensor([ 3, 12,  8, 10,  2]),
 tensor([ 9,  4,  7, 14,  5]),
 tensor([ 1, 13,  0,  6, 11])]
ds = L(enumerate(string.ascii_lowercase)) 
ds
(#26) [(0, 'a'),(1, 'b'),(2, 'c'),(3, 'd'),(4, 'e'),(5, 'f'),(6, 'g'),(7, 'h'),(8, 'i'),(9, 'j')...]
dl = DataLoader(ds, batch_size=6, shuffle=True)  #Works with tuples as well
list(dl)
[(tensor([17, 18, 10, 22,  8, 14]), ('r', 's', 'k', 'w', 'i', 'o')),
 (tensor([20, 15,  9, 13, 21, 12]), ('u', 'p', 'j', 'n', 'v', 'm')),
 (tensor([ 7, 25,  6,  5, 11, 23]), ('h', 'z', 'g', 'f', 'l', 'x')),
 (tensor([ 1,  3,  0, 24, 19, 16]), ('b', 'd', 'a', 'y', 't', 'q')),
 (tensor([2, 4]), ('c', 'e'))]

End Sidebar

Creating and testing with our own batch first

Before we go further, why don't we simulate what takes place with our own minibatch.

P0: Get batch

batch = train_x[:4]
batch.shape
torch.Size([4, 784])
linear1??

P1: Initialize parameters

preds = linear1(batch) #Get pred (initialize weights)
preds
tensor([[11.6180],
        [ 9.0489],
        [-2.4524],
        [-2.5197]], grad_fn=<AddBackward0>)

P2: Calc loss

loss = mnist_loss(preds, train_y[:4])
loss
tensor(0.4616, grad_fn=<MeanBackward0>)
mnist_loss??

P3: Calc grad

loss.backward()
weights.grad.shape, weights.grad.mean(), bias.grad
(torch.Size([784, 1]), tensor(-0.0057), tensor([-0.0355]))

Put the above into a single func

We can take everything we did above and make a simple function encompassing it

def calc_grad(xb, yb, model):
    preds = model(xb)
    loss = mnist_loss(preds, yb)
    loss.backward()

How to do an epoch

To do an epoch, we just need to grab each minibatch from the dataset, call calc_grad, and then step the weights.

def train_epoch(model, lr, params):
    for xb,yb in dl: #get x and y batch
        calc_grad(xb, yb, model) #Calc grad
        
        for p in params: 
            p.data -= p.grad*lr #Update/take a step
            p.grad.zero_() #Set grad to zero
(preds>0.0).float() == train_y[:4]
tensor([[ True],
        [ True],
        [False],
        [False]])

How to calc accuracy

Accuracy is another function we need. Using the function below we can determine the performance of the model. Here, anytime the prediction is pos greater than .5 (Where pred was pos), it is correct.

def batch_accuracy(xb, yb):
    preds = xb.sigmoid()
    correct = (preds>0.5).float() == yb  #.5 because sigmoid(0) = .5
    return correct.float().mean()
batch_accuracy(linear1(batch), train_y[:4])
tensor(0.5000)
def validate_epoch(model):
    accs = [batch_accuracy(model(xb), yb) for xb,yb in valid_dl]
    return round(torch.stack(accs).mean().item(), 4)
validate_epoch(linear1)
0.5484

Putting everything togather

Below we initialize the weights, create a dataloader, train, and test

weights = init_params((28*28,1))
bias = init_params(1)
dl = DataLoader(dset, batch_size=256) #create minibatches

#We can grab the first batch and take a look at it
xb,yb = first(dl)
xb.shape,yb.shape
(torch.Size([256, 784]), torch.Size([256, 1]))
valid_dl = DataLoader(valid_dset, batch_size=256) #Create minibatch for validation set
lr = 1.
params = weights,bias
train_epoch(linear1, lr, params)
validate_epoch(linear1)
0.637
for i in range(20):
    train_epoch(linear1, lr, params)
    print(validate_epoch(linear1))
0.791
0.8925
0.9315
0.9476
0.9515
0.9579
0.9637
0.9652
0.9667
0.9672
0.9677
0.9701
0.9706
0.9711
0.9725
0.9735
0.9735
0.974
0.974
0.9745

Congratulation you have created ur official ML model from scratch!

Let's now optimize what we did above

linear_model = nn.Linear(28*28,1) #Does exactly what out funtion linear1 does and initialzes our parameters for us
w,b = linear_model.parameters()
w.shape,b.shape
(torch.Size([1, 784]), torch.Size([1]))
class BasicOptim:
    def __init__(self,params,lr): 
        self.params = list(params)
        self.lr = lr
 
    def step(self, *args, **kwargs):
        for p in self.params: 
            p.data -= p.grad.data * self.lr

    def zero_grad(self, *args, **kwargs):
        for p in self.params: 
            p.grad = None
opt = BasicOptim(linear_model.parameters(), lr)
def train_epoch(model):
    for xb,yb in dl:
        calc_grad(xb, yb, model)
        opt.step()
        opt.zero_grad()
validate_epoch(linear_model)
0.3794
def train_model(model, epochs):
    for i in range(epochs):
        train_epoch(model)
        print(validate_epoch(model), end=' ')
train_model(linear_model, 20)
0.4932 0.7476 0.8511 0.9155 0.9346 0.9482 0.9555 0.9629 0.9658 0.9673 0.9707 0.9722 0.9736 0.9751 0.9761 0.9766 0.9775 0.9775 0.9785 0.9785 

Can further optimize by using FastAI ToolKit

linear_model = nn.Linear(28*28,1) #fastAI
opt = SGD(linear_model.parameters(), lr) #fastAI
train_model(linear_model, 20)
0.4932 0.7393 0.8613 0.9175 0.9365 0.9497 0.957 0.9634 0.9663 0.9673 0.9702 0.9727 0.9736 0.9751 0.9761 0.9766 0.9775 0.978 0.978 0.979 

Finally, the most simplest way

dls = DataLoaders(dl, valid_dl) #NOT dataLoader, this class stores away the train and valid data into a single obj
learn = Learner(dls, nn.Linear(28*28,1), opt_func=SGD,
                loss_func=mnist_loss, metrics=batch_accuracy)
learn.fit(10, lr=lr)
epoch train_loss valid_loss batch_accuracy time
0 0.636365 0.503465 0.495584 00:00
1 0.521433 0.170448 0.864082 00:00
2 0.191138 0.189521 0.824828 00:00
3 0.083866 0.109406 0.910697 00:00
4 0.044337 0.079229 0.932777 00:00
5 0.028900 0.063259 0.947007 00:00
6 0.022547 0.053348 0.954367 00:00
7 0.019723 0.046764 0.961727 00:00
8 0.018294 0.042152 0.965653 00:00
9 0.017439 0.038766 0.967615 00:00

I hope you now feel comfortable creating from scratch as well as using FastAI ToolKit where possible

Adding a Nonlinearity

We can improve our model by adding some nonlinearity to it. So far we have been using a simple linear classifier. A linear classifier is very constrained. To make it a perform better, we need to add something nonlinear between two linear classifiers—this is what gives us a neural network.

def simple_net(xb): 
    res = xb@w1 + b1 #Linear func
    res = res.max(tensor(0.0)) #Activation func: ReLU
    res = res@w2 + b2 #Linear func
    return res

This is all you need to change to implement nonlinearity. Before we were using linear1. Compare this to that.

w1 = init_params((28*28,30))
b1 = init_params(30)
w2 = init_params((30,1))
b2 = init_params(1)
plot_function(F.relu)
simple_net = nn.Sequential(
    nn.Linear(28*28,30), #30 sets of weights
    nn.ReLU(),
    nn.Linear(30,1) #convert back into 1 set of weights
)
learn = Learner(dls, simple_net, opt_func=SGD,
                loss_func=mnist_loss, metrics=batch_accuracy)
learn.fit(40, 0.1)
epoch train_loss valid_loss batch_accuracy time
0 0.324376 0.404988 0.505888 00:00
1 0.150483 0.233413 0.803238 00:00
2 0.083043 0.117188 0.912169 00:00
3 0.054295 0.078788 0.940628 00:00
4 0.040957 0.061365 0.956330 00:00
5 0.034142 0.051610 0.962709 00:00
6 0.030232 0.045451 0.965653 00:00
7 0.027697 0.041252 0.967125 00:00
8 0.025878 0.038205 0.968106 00:00
9 0.024475 0.035891 0.970069 00:00
10 0.023342 0.034062 0.972522 00:00
11 0.022400 0.032572 0.973013 00:00
12 0.021602 0.031321 0.973994 00:00
13 0.020913 0.030250 0.973994 00:00
14 0.020312 0.029316 0.974975 00:00
15 0.019781 0.028492 0.976448 00:00
16 0.019307 0.027757 0.977920 00:00
17 0.018882 0.027094 0.978901 00:00
18 0.018496 0.026495 0.978901 00:00
19 0.018143 0.025950 0.979392 00:00
20 0.017820 0.025452 0.979392 00:00
21 0.017520 0.024996 0.979392 00:00
22 0.017242 0.024576 0.979882 00:00
23 0.016983 0.024189 0.980864 00:00
24 0.016740 0.023832 0.981354 00:00
25 0.016511 0.023501 0.981354 00:00
26 0.016295 0.023195 0.981354 00:00
27 0.016090 0.022910 0.981354 00:00
28 0.015896 0.022645 0.982826 00:00
29 0.015711 0.022398 0.982826 00:00
30 0.015535 0.022167 0.982336 00:00
31 0.015367 0.021952 0.982826 00:00
32 0.015206 0.021750 0.982826 00:00
33 0.015052 0.021560 0.982826 00:00
34 0.014904 0.021382 0.982826 00:00
35 0.014762 0.021215 0.982826 00:00
36 0.014625 0.021057 0.982826 00:00
37 0.014494 0.020908 0.982826 00:00
38 0.014367 0.020767 0.982826 00:00
39 0.014245 0.020634 0.982826 00:00
plt.plot(L(learn.recorder.values).itemgot(2));

Sidebar: Now that we have trained our model, we can view its parameters

m = learn.model
m
Sequential(
  (0): Linear(in_features=784, out_features=30, bias=True)
  (1): ReLU()
  (2): Linear(in_features=30, out_features=1, bias=True)
)
w, b = m[0].parameters()
w.shape
torch.Size([30, 784])
w[0].view(28,28)
show_image(w[2].view(28,28))
<AxesSubplot:>

Seems like this neuron was looking for curves

Going Deeper - Using ResNet

FastAI has various resnets anyone can use. Lets see how that compares to our learner.

dls = ImageDataLoaders.from_folder(path)
learn = cnn_learner(dls, resnet18, pretrained=False,
                    loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(1, 0.1) 

#Thats insane
epoch train_loss valid_loss accuracy time
0 0.132423 0.036220 0.995093 00:17

Out preformed our model in a single epoch! Guess we still have a lot more to learn :)

Questionnaire

  1. How is a grayscale image represented on a computer? How about a color image?
    Image on the computer are represented by a number value, where 0=white, 255=black, and the grayscale inbetween.
    A grayscale image is rank 2 (No color channels)
    A color image is rank 3 (Has the 3 color channels, RGB)
  2. How are the files and folders in the MNIST_SAMPLE dataset structured? Why?
    Files are split into train, valid, labels. This makes it easier as the training and validation set have already been presplit for for.
  3. Explain how the "pixel similarity" approach to classifying digits works.
    This is similer to the Nearest neighbors approach, where one compare each test image with all training images. Only here, the image being compared to is an average of all the training images. Then using a distance metric we can find the abs difference between the images to identify it.
  4. What is a list comprehension? Create one now that selects odd numbers from a list and doubles them. A python condensing technique used with for-loop.
l = [i for i in range(20)]
oddList = [i**2 for i in l if i%2 != 0]
  1. What is a "rank-3 tensor"?
    A 3 dimensional tensor (Also known as a volumn).
  2. What is the difference between tensor rank and shape? How do you get the rank from the shape?
    Rank refers to the number of dimensions in a tensor
    Shape is the size of each dimension of a tensor

    Taking the len(shape) = rank

  3. What are RMSE and L1 norm?
    Loss functions
  4. How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?
    Broadcasting
  5. Create a 3×3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom-right four numbers.
    t = tensor(list(range(1,10))).view(3,3)
    t[1:,0:2]
    
  6. What is broadcasting?
    A technique of applying an operation onto all values within an object, often, regardless of tensor (Exceptions do apply).
  7. Are metrics generally calculated using the training set, or the validation set? Why?
    Validation set as it contains unseen data.
  8. What is SGD?
    Optimization algorithm. This is what causes the loss to decrease as it steps/updates the parameters.
  9. Why does SGD use mini-batches?
    Minibatches are faster and more efficient on GPU. Also, they gradient is calculated more appropriately as doing it across the entire batch could cause unstable and imprecise gradients.
  10. What are the seven steps in SGD for machine learning?
    Initialize parameters
    Compute perdiction
    Get loss
    Get gradients
    Update wieghts
    Repeat
    Stop
  11. How do we initialize the weights in a model?
    Randomly
  12. What is "loss"?
    A metric used by the computer to determine its performance
  13. Why can't we always use a high learning rate?
    Stepping to far can cause the model to increase loss or bounce and diverge
  14. What is a "gradient"?
    Slope. This tell us how much we have to change each weight to make our model better.
  15. Do you need to know how to calculate gradients yourself?
    No
  16. Why can't we use accuracy as a loss function?
    A loss function needs to change as the weights are being adjusted. Accuracy only changes if the predictions of the model changes.
  17. Draw the sigmoid function. What is special about its shape?
    Squishes all values between 0-1
  18. What is the difference between a loss function and a metric?
    The loss function is understood by the computer, while a metric is understood by us humans.
  19. What is the function to calculate new weights using a learning rate?
    The optimizer step function (Ex: SGD).
  20. What does the DataLoader class do?
    Creates minibatches
  21. Write pseudocode showing the basic steps taken in each epoch for SGD.

    for x,y in data:
     pred = model(x)
     loss = loss_func(pred, y)
     loss.backward()
    
     for p in self.params:
         p -= parameters.grad * lr
         p.grad = None
    
  22. Create a function that, if passed two arguments [1,2,3,4] and 'abcd', returns [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]. What is special about that output data structure?
    def func(l1,l2):
     return list(zip(l1,l2))
    
  23. What does view do in PyTorch?
    Reshapes tensor
  24. What are the "bias" parameters in a neural network? Why do we need them?
    So that the gradient isnt set to 0 during the first iteration.
  25. What does the @ operator do in Python?
    Matrix multi
  26. What does the backward method do?
    Calculated gradients
  27. Why do we have to zero the gradients?
    PyTorch remembers the previously stored gradients
  28. What information do we have to pass to Learner?
    dataset (DataLoaders), model (Ex: nn.Linear), opt func (Ex: SGD), loss func (Ex: mnist_loss), metric(Optional)
  29. Show Python or pseudocode for the basic steps of a training loop.

    def train_epoch(model,lr,params):
         for x,y in dl:
             calc_grad(x,y,model)
    
         for p in self.params:
             p -= parameters.grad * lr
             p.grad = None
    
     for i in range(epochs):
         train_epoch(model, lr, params)
    
  30. What is "ReLU"? Draw a plot of it for values from -2 to +2.
    Activation function

  31. What is an "activation function"?
    The purpose of an activation function is to add non-linearity to the model.
  32. What's the difference between F.relu and nn.ReLU?
    F.relu is a Python function nn.ReLU is a PyTorch module (So part of a class)
  33. The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?
    There are performance benefits to using more than one nonlinearity

Further Research

  1. Create your own implementation of Learner from scratch, based on the training loop shown in this chapter.
  2. Complete all the steps in this chapter using the full MNIST datasets (that is, for all digits, not just 3s and 7s). This is a significant project and will take you quite a bit of time to complete! You'll need to do some of your own research to figure out how to overcome some obstacles you'll meet on the way.
    Completed, see here: https://usama280.github.io/PasteBlogs/