Learning the basics of classification

Now that we are comfortable creating models useing FastAI's toolkit, lets go back to the basics. Below is a simple MNIST dataset containing 3's and 7's. Lets try to classify this without using FastAI's toolkit.

Getting and viewing data

path = untar_data(URLs.MNIST_SAMPLE)  #path for data
path.ls()
(#3) [Path('train'),Path('labels.csv'),Path('valid')]
(path/'train').ls()
(#2) [Path('train/3'),Path('train/7')]
threes = (path/'train'/'3').ls().sorted() #getting 3's data from path
sevens = (path/'train'/'7').ls().sorted() #getting 7's data from path
threes
(#6131) [Path('train/3/10.png'),Path('train/3/10000.png'),Path('train/3/10011.png'),Path('train/3/10031.png'),Path('train/3/10034.png'),Path('train/3/10042.png'),Path('train/3/10052.png'),Path('train/3/1007.png'),Path('train/3/10074.png'),Path('train/3/10091.png')...]
im3_path = threes[1]
im3 = Image.open(im3_path) #shows image
im3
array(im3)[4:10,4:10] #numpy to convert image into quantitative rep
array([[  0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,  29],
       [  0,   0,   0,  48, 166, 224],
       [  0,  93, 244, 249, 253, 187],
       [  0, 107, 253, 253, 230,  48],
       [  0,   3,  20,  20,  15,   0]], dtype=uint8)
tensor(im3)[4:10,4:10] #same thing as numpy array but work better on GPU's (preferred)
tensor([[  0,   0,   0,   0,   0,   0],
        [  0,   0,   0,   0,   0,  29],
        [  0,   0,   0,  48, 166, 224],
        [  0,  93, 244, 249, 253, 187],
        [  0, 107, 253, 253, 230,  48],
        [  0,   3,  20,  20,  15,   0]], dtype=torch.uint8)
im3.shape
(28, 28)
im3_t = tensor(im3)
df = pd.DataFrame(im3_t[4:15,4:22])
df.style.set_properties(**{'font-size':'6pt'}).background_gradient('Greys') #Using panda's framework
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 29 150 195 254 255 254 176 193 150 96 0 0 0
2 0 0 0 48 166 224 253 253 234 196 253 253 253 253 233 0 0 0
3 0 93 244 249 253 187 46 10 8 4 10 194 253 253 233 0 0 0
4 0 107 253 253 230 48 0 0 0 0 0 192 253 253 156 0 0 0
5 0 3 20 20 15 0 0 0 0 0 43 224 253 245 74 0 0 0
6 0 0 0 0 0 0 0 0 0 0 249 253 245 126 0 0 0 0
7 0 0 0 0 0 0 0 14 101 223 253 248 124 0 0 0 0 0
8 0 0 0 0 0 11 166 239 253 253 253 187 30 0 0 0 0 0
9 0 0 0 0 0 16 248 250 253 253 253 253 232 213 111 2 0 0
10 0 0 0 0 0 0 0 43 98 98 208 253 253 253 253 187 22 0

First Try: Pixel Similarity

Why don't we begin by creating an ideal image of a three based on the training set. Then having that ideal image compared with the other images of 3's and 7's to classify them?

seven_tensors = [tensor(Image.open(o)) for o in sevens]
three_tensors = [tensor(Image.open(o)) for o in threes]
len(three_tensors), len(seven_tensors)
(6131, 6265)
show_image(three_tensors[1]);
type(three_tensors)
list

Right now our tensors are lists. We must fix this by stacking them.

stacked_sevens = torch.stack(seven_tensors).float()/255
stacked_threes = torch.stack(three_tensors).float()/255
stacked_threes.shape
torch.Size([6131, 28, 28])
len(stacked_threes.shape) #Returns rank. Rank means it has n dementions
3
stacked_threes.ndim
3
mean3 = stacked_threes.mean() #notice taking mean gives a number
mean3
tensor(0.1415)
mean3 = stacked_threes.mean(0) #the 0 represents the axis we are doing the mean across (In this case across the first axis 6131)
show_image(mean3);

Here's our ideal 3

mean7 = stacked_sevens.mean(0)
show_image(mean7);

And our ideal 7, which we will also be using as a comparison

a_3 = stacked_threes[1]
show_image(a_3);

Difference formula

To determine the difference between our ideal image and the training images we need to make use of some formula. Particularly, using either the mean absolute difference(L1 norm) or the root mean squared error(L2 norm).

dist_3_abs = (a_3 - mean3).abs().mean() #L1 norm
dist_3_sqr = ((a_3 - mean3)**2).mean().sqrt() #RMSE or L2 norm
dist_3_abs,dist_3_sqr
(tensor(0.1114), tensor(0.2021))
dist_7_abs = (a_3 - mean7).abs().mean()
dist_7_sqr = ((a_3 - mean7)**2).mean().sqrt()
dist_7_abs,dist_7_sqr
(tensor(0.1586), tensor(0.3021))
F.l1_loss(a_3.float(),mean7),   F.mse_loss(a_3,mean7).sqrt()
(tensor(0.1586), tensor(0.3021))

Sidebar: NumPy Arrays and PyTorch Tensors

Just some comparison between tensors and numpy arrays

data = [[1,2,3],[4,5,6]]
arr = array (data)
tns = tensor(data)
arr  # numpy
array([[1, 2, 3],
       [4, 5, 6]])
tns  # pytorch
tensor([[1, 2, 3],
        [4, 5, 6]])
tns[1]
tensor([4, 5, 6])
tns[:,1]
tensor([2, 5])
tns[1,1:3]
tensor([5, 6])
tns+1
tensor([[2, 3, 4],
        [5, 6, 7]])
tns.type()
'torch.LongTensor'
tns*1.5
tensor([[1.5000, 3.0000, 4.5000],
        [6.0000, 7.5000, 9.0000]])

End Sidebar

Training set

As before, lets create our training set, or validation set in this case. Similer to before, we will be stacking all the images.

valid_3_tens = torch.stack([tensor(Image.open(o)) 
                            for o in (path/'valid'/'3').ls()])
valid_3_tens = valid_3_tens.float()/255

valid_7_tens = torch.stack([tensor(Image.open(o)) 
                            for o in (path/'valid'/'7').ls()])
valid_7_tens = valid_7_tens.float()/255


valid_3_tens.shape,   valid_7_tens.shape
(torch.Size([1010, 28, 28]), torch.Size([1028, 28, 28]))

Loss function

This is our chosen distance formula. We will be using this to classify 3's and 7's.

def mnist_distance(a,b): 
  return (a-b).abs().mean((-1,-2)) #mean across 2nd and last axis (28,28)
mnist_distance(a_3, mean3)
tensor(0.1114)

Computing Metrics Using Broadcasting

Broadcasting is a technique of vectorizing array operations so that looping occurs in C instead of Python. This is far more efficient and faster than looping. See below broadcasting in action, along with an example.

valid_3_dist = mnist_distance(valid_3_tens, mean3) #broadcast method across entire validation set
valid_3_dist, valid_3_dist.shape
(tensor([0.1787, 0.1422, 0.1412,  ..., 0.1358, 0.1301, 0.1110]),
 torch.Size([1010]))
tensor([1,2,3]) + tensor([1]) #broadcasting example
tensor([2, 3, 4])
(valid_3_tens-mean3).shape
torch.Size([1010, 28, 28])
def is_3(x): return mnist_distance(x,mean3) < mnist_distance(x,mean7)
a_7 = stacked_sevens[1]
a_7
show_image(a_7)
<AxesSubplot:>
is_3(a_7), is_3(a_7).float() #tensor val is 0
(tensor(False), tensor(0.))

Nice it's working!

is_3(valid_3_tens) #Broadcasting
tensor([True, True, True,  ..., True, True, True])

Testing

Now lets test it and view our accuracy!

accuracy_3s =      is_3(valid_3_tens).float() .mean()
accuracy_7s = (1 - is_3(valid_7_tens).float()).mean()

accuracy_3s,accuracy_7s,(accuracy_3s+accuracy_7s)/2
(tensor(0.9168), tensor(0.9854), tensor(0.9511))

Wow 95% accuracy! You just created your very first model from scratch!

Lets test our model using our own image

uploader = widgets.FileUpload()
uploader
img = PILImage.create(uploader.data[0])
img.to_thumb(40)

I drew this in paint

img = img.resize((28,28)) #Resizing img
t_3 = tensor(img) #converting to tensor
 
t_3.shape
torch.Size([28, 28, 3])
t_3 = t_3[:,:,0] #dropping channels
show_image(t_3) #Awesome now it looks like an image from our dataset
<AxesSubplot:>
t_3.shape
torch.Size([28, 28])
is_3(t_3) #Nice, it got it right!
tensor(True)

Conclusion

I recommend you play around with this model we created, and try to spot some limitations to it. Although this is a ML model, it certainly is far from the ideal ML models.

Stochastic Gradient Descent (SGD)

What makes machine learning models truly learnable is SGD. Below you will see the influence of SGD on training a model.

An End-to-End SGD Example

time = torch.arange(0,20).float(); time
tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19.])
speed = torch.randn(20)*3 + 0.75*(time-9.5)**2 + 1
plt.scatter(time,speed);

Lets assume we have this data given

def f(t, params):
    a,b,c = params
    return a*(t**2) + (b*t) + c
def mse(preds, targets): 
    return ((preds-targets)**2).mean().sqrt() #mean squared error

Step 1: Initialize the parameters

All ML models have parameters. Parameters are random values that represent the influence of each value on the output. We often refer to these as weights.

params = torch.randn(3).requires_grad_() #Getting random weights and requiring gradient
params
tensor([0.2815, 0.0562, 0.5227], requires_grad=True)

We are requiring gradients because this is what we will be using to improve our weights

Step 2: Calculate the predictions

Lets get a prediction using the random weights we initialzed above.

preds = f(time, params)
def show_preds(preds, ax=None):
    if ax is None: ax=plt.subplots()[1]
    ax.scatter(time, speed)
    ax.scatter(time, to_np(preds), color='red')
    ax.set_ylim(-300,100)
show_preds(preds) #Red is our predictions, Blue is the labels

Not bad for an initial prediction

Step 3: Calculate the loss

Lets now calculate the loss. Loss is a metric that is used by the model to determine it's performance.

loss = mse(preds, speed) #Current loss
loss
tensor(35.6327, grad_fn=<SqrtBackward>)

Step 4: Calculate the gradients

To improve the loss, the gradients of the weights need to be calculated. Why the gradient? Because the gradient gives us the slope, which we need to determine how to step the weights.

loss.backward()
params.grad
tensor([121.4830,   7.8875,   0.3013])
params.grad * 1e-5      #1e-5 is the learning rate
tensor([1.2148e-03, 7.8875e-05, 3.0131e-06])

Step 5: Step the weights.

Now that we have everything, lets update our weights

lr = 1e-4
params.data -= lr * params.grad.data
params.grad = None
preds = f(time,params)
mse(preds, speed) #loss improved
tensor(34.1798, grad_fn=<SqrtBackward>)

Notice that our loss has improved with these new weights

show_preds(preds)

Putting all of the above into a simple function

Lets grab all the steps we did above and put it into one function for simplicity.

def apply_step(params, prn=True):
    preds = f(time, params)
    loss = mse(preds, speed)
    loss.backward()
    params.data -= lr * params.grad.data
    params.grad = None
    if prn: print(loss.item())
    return preds

Step 6: Repeat the process (Training)

Now we just need to trian our model.

for i in range(10): apply_step(params) #running it 10 times (Notice loss improving)
34.17982482910156
32.844730377197266
31.63145637512207
30.54193687438965
29.575712203979492
28.729764938354492
27.998600006103516
27.374542236328125
26.84825325012207
26.409330368041992
_,axs = plt.subplots(1,4,figsize=(12,3))
for ax in axs: show_preds(apply_step(params, False), ax)
plt.tight_layout()

Conclusion

Compare this approach to the pixel similerity example. You will notice that here it truly seems like the model is learning: This is the approach we will be using from now on as it better fits the ideal ML model.

I have chosen to split this lesson into 2 parts. On the next lesson we will apply what we have learned to the MNIST dataset once again. This time using the more approriate ML approach.