NLP Deep Dive: RNNs

We are now going to take a look into natural language processing. Were going to build two models: One that can predict the next word (generate text), and another that can classify if a text is positive or negative. Note: We will be using a movie review dataset for this model.

Grab path

from fastai.text.all import *
path = untar_data(URLs.IMDB) #our data path

Path.BASE_PATH = path
path.ls()

(#7) [Path('train'),Path('imdb.vocab'),Path('tmp_lm'),Path('unsup'),Path('tmp_clas'),Path('README'),Path('test')]

(path/'train/pos').ls() #the path consists of text files

(#12500) [Path('train/pos/5840_7.txt'),Path('train/pos/7429_9.txt'),Path('train/pos/8401_10.txt'),Path('train/pos/4606_7.txt'),Path('train/pos/11152_10.txt'),Path('train/pos/11180_7.txt'),Path('train/pos/11887_8.txt'),Path('train/pos/8072_10.txt'),Path('train/pos/5256_10.txt'),Path('train/pos/6267_10.txt')...]

files = get_text_files(path, folders = ['train', 'test', 'unsup']) #lets grab the following folders

txt = files[0].open().read() 
txt[:75]

'While the premise of the film is pretty lame (Ollie is diagnosed with "horn'

Word Tokenization with FastAI

To store the words, will be using a tokenizer. There are many benefits to using tokenizers as you will see below.

spacy = WordTokenizer()  #our tokenizer
toks = first(spacy([txt])) #tokenize our scentence and grab first 
print(coll_repr(toks, 30))

(#365) ['While','the','premise','of','the','film','is','pretty','lame','(','Ollie','is','diagnosed','with','"','hornophobia','"',')',',','the','film','is','an','amiable','and','enjoyable','little','flick','.','It'...]

first(spacy(['The U.S. dollar $1 is $1.00.']))

(#9) ['The','U.S.','dollar','$','1','is','$','1.00','.']

Notice U.S. and 1.00 is not seperated:This is one reason why tokenizers are useful.

tkn = Tokenizer(spacy) #Wrapper that designates special tokens
print(coll_repr(tkn(txt), 31))

(#403) ['xxbos','xxmaj','while','the','premise','of','the','film','is','pretty','lame','(','ollie','is','diagnosed','with','"','hornophobia','"',')',',','the','film','is','an','amiable','and','enjoyable','little','flick','.'...]

xxbos:Beggining of text > xxmaj:Next word was capital > xxnuk:Next word is unknown > xxrep:Repeated words
Just know that anything that is xx___ is a special token

defaults.text_proc_rules #some more rules

[<function fastai.text.core.fix_html(x)>,
 <function fastai.text.core.replace_rep(t)>,
 <function fastai.text.core.replace_wrep(t)>,
 <function fastai.text.core.spec_add_spaces(t)>,
 <function fastai.text.core.rm_useless_spaces(t)>,
 <function fastai.text.core.replace_all_caps(t)>,
 <function fastai.text.core.replace_maj(t)>,
 <function fastai.text.core.lowercase(t, add_bos=True, add_eos=False)>]

coll_repr(tkn('&copy;   Fast.ai www.fast.ai/INDEX'), 31)

"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"

Subword tokenization is a new tokenizer that determines words not based on spaces but frequency. This is actually better than the WordTokenizer as it can determine words from character based languages that lack spaces (Ex: Chinese).

txts = L(o.open().read() for o in files[:2000])

def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts) #trains subword token for the most commonly occuring words
    
    return ' '.join(first(sp([txt]))[:40])

subword(1000)

'▁Whil e ▁the ▁pre m ise ▁of ▁the ▁film ▁is ▁pretty ▁la me ▁( O ll ie ▁is ▁di ag no s ed ▁with ▁" h or n op ho b ia " ), ▁the ▁film ▁is ▁an ▁a mi'

Here words that are togather (no spaces between letters) are very common:Example, 'the' and 'pretty'

subword(200) #lets make the word vocab smaller

'▁ W h i le ▁the ▁p re m is e ▁of ▁the ▁film ▁is ▁p re t t y ▁ la m e ▁ ( O ll i e ▁is ▁d i a g n o s ed ▁with'

Notice that many words have yet to be identified. (Unlike film and with)

subword(10000) #lets train a much larger word vocab

'▁Whil e ▁the ▁premise ▁of ▁the ▁film ▁is ▁pretty ▁lame ▁( O ll ie ▁is ▁diagnos ed ▁with ▁" h or no pho b ia ") , ▁the ▁film ▁is ▁an ▁a mi able ▁and ▁enjoyable ▁little ▁flick . ▁It'

End Sidebar

Numericalization with fastai

Numericalization alters tokens so that only numerical values are within the lists: These values refer to the vocab.

Example below

toks200 = txts[:200].map(tkn)
toks200[0]

(#403) ['xxbos','xxmaj','while','the','premise','of','the','film','is','pretty'...]

num = Numericalize() #returns tokens in order of freq
num.setup(toks200)
coll_repr(num.vocab,20)

"(#2200) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','and','a','of','to','is','in','it','i'...]"

toks200.map(num)[0][:10]

TensorText([  2,   8, 171,   9,   0,  14,   9,  29,  16, 188])

Now lets do it on our text

tkn = Tokenizer(WordTokenizer())
toks = tkn(txt)
print(coll_repr(tkn(txt), 31))

(#403) ['xxbos','xxmaj','while','the','premise','of','the','film','is','pretty','lame','(','ollie','is','diagnosed','with','"','hornophobia','"',')',',','the','film','is','an','amiable','and','enjoyable','little','flick','.'...]

nums = num(toks)[:20]  #numericalization
nums

TensorText([   2,    8,  171,    9,    0,   14,    9,   29,   16,  188, 1243,   33, 1244,   16,    0,   27,   24,    0,   24,   32])

Notice that these number values refer to the vocab. Lets decode below

' '.join(num.vocab[o] for o in nums) #we can decode doing the following

'xxbos xxmaj while the xxunk of the film is pretty lame ( ollie is xxunk with " xxunk " )'

Creating Batches for Language Model

stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
tokens = tkn(stream)
bs,seq_len = 6,15
d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

This batch is too big for our model, lets adjust it

Minibatches

Below is an example of 3 minibatches. Notice that in each following minibatch the nth row follows the previous minibatches nth row.Example,

M1: xxbos xxmaj in this chapter
M2: , we will go back
M3: over the example of classifying

bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+seq_len:i*15+2*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+10:i*15+15] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

nums200 = toks200.map(num)

dl = LMDataLoader(nums200)

This dataloader takes care of creating the appropraite minibatches for us

x,y = first(dl)
x.shape,y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))

64 is batchsize, 72 is the seq length

' '.join(num.vocab[o] for o in x[0][:20])

'xxbos xxmaj while the xxunk of the film is pretty lame ( ollie is xxunk with " xxunk " )'

' '.join(num.vocab[o] for o in y[0][:20])

'xxmaj while the xxunk of the film is pretty lame ( ollie is xxunk with " xxunk " ) ,'

Notice that the label is offset by 1 word:This is what we want

Language Model Using DataBlock

get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dblock = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, 
    splitter=RandomSplitter(0.1)
)

dls_lm = dblock.dataloaders(path, path=path, bs=128, seq_len=80)

dls_lm.show_batch(max_n=2)

show_batch denumericalize for us, but in reality its numericalized. See below

dls_lm.one_batch()[0]

LMTensorText([[    2,     8,   121,  ...,    42,    13,   190],
        [   23,     9,   522,  ...,    13,  9706,   359],
        [35022,    48,   121,  ...,    15,   159,    10],
        ...,
        [ 2202,     8, 22400,  ...,  6995,    13,   650],
        [33649,     8,  2712,  ...,    14,    21,   898],
        [   16,    36,    10,  ...,    28,    45,   734]], device='cuda:0')

Training

learn = language_model_learner(
    dls_lm, AWD_LSTM,  #AWD_LSTM is a precreated architecture
    drop_mult=0.3, metrics=[accuracy, Perplexity()]).to_fp16()

learn.fit_one_cycle(1, 2e-2)

30% accuracy is actually not bad

Saving and Loading Models

learn.save('1epoch')

learn = learn.load('1epoch')

Further training

learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

learn.save_encoder('finetuned') #This saves the model without the final layer

Testing model

We can now test our model. Because our model is made to predict the next word, it can generate text given any input. This generated text is the prediction of the model.

TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2 #lets generate 2 sentences

                                        #randomness
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]

print("\n".join(preds))

i liked this movie because of its story and characters . The story line was very strong , very good for a sci - fi film . The main character , Alucard , was very well developed and brought the whole story
i liked this movie because i like the idea of the premise of the movie , the ( very ) convenient virus ( which , when you have to kill a few people , the " evil " machine has to be used to protect

Text Classifier

Now lets create a text classifier, which can classify if the text is positive or negative. For this we will be using our pretrained model that we saved above.

Classifier DataBlock

dblock = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test') #splits via folder name
)

dls_clas = dblock.dataloaders(path, path=path, bs=128, seq_len=72)

dls_clas.show_batch(max_n=3)

nums_samp = toks200[:10].map(num) #lets grab some reviews

nums_samp.map(len)

(#10) [403,176,151,63,185,905,417,97,183,397]

Notice they vary in lengths. This can be a problem, however, FastAI DataBlock takes care of it for us by using padding

learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                                metrics=accuracy).to_fp16()

learn = learn.load_encoder('finetuned') #lets load our model from before

Fine-Tuning the Classifier

learn.fit_one_cycle(1, 2e-2)

Notice how quickly it trained:This is the benefit of using pretrained models and only fitting on the final layer.

Refining

Lets refine the model by training some more. For NLP it's better to only freeze a couple of layers at a time, rather than the entire thing. So, below we can do this by calling .freeze_to(-2), which freeze all except the last two parameter groups:

learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

Then we can unfreeze a bit more, and continue training:

learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

And finally, the whole model!

learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

This accuracy is very good!

Conclusion

Overall, natural language models are very powerful and benefitial. Hopefully, you learned how to create a model that can generate text, and more importantly, can be used to classify text as well (Transfer learning).

Questionnaire

What is "self-supervised learning"?
Learning where model has no labels.
What is a "language model"?
A language model is a model that tries to predict the next word in a text.
Why is a language model considered self-supervised?
Because it does not require any labels needed to learn.
What are self-supervised models usually used for?
Often they are used as pre-trained model for transfer learning.
Why do we fine-tune language models?
By finetuning (final layers) we can fit a model to our data. Note, this assumes the data being fit on is similer.
What are the three steps to create a state-of-the-art text classifier?
Train a language model
Finetune language model on classification dataset
Finetune further as classifier
How do the 50,000 unlabeled movie reviews help us create a better text classifier for the IMDb dataset?
It has been trained to predict the next word: To do this, the model understands the language (Ex: sentiment).
What are the three steps to prepare your data for a language model?
Tokenization
Numericalization
DataLoader
What is "tokenization"? Why do we need it?
Tokenization splits words into a list: However, it's not that simple as it is vary of punctuations, syntax, etc.
Name three different approaches to tokenization.
Word-based tokenization
Subword-based tokenization
Character-based tokenization
What is xxbos?
Beginning of text
List four rules that fastai applies to text during tokenization.
xxrep, xxbox, xxcap, xxeos
Why are repeated characters replaced with a token showing the number of repetitions and the character that's repeated?
We can expect that repeated characters have special or different meaning than just a single character: Hence, why it is better to use a token to repersent this distinction.
What is "numericalization"?
The mapping of values to vocab
Why might there be words that are replaced with the "unknown word" token?
Such words make the embedding matrix far too large and increase memory usage.
With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer on the book's website.)
Minibatch of the nth row follows the previous minibatches nth row.
Why do we need padding for text classification? Why don't we need it for language modeling?
Padding is needed because each text is of different sizes. It is not required for language modeling as the documents are all concatenated.
What does an embedding matrix for NLP contain? What is its shape?
It contains vector representations of all tokens in the vocabulary. The embedding matrix has the size vocab_size x embedding_size.
What is "perplexity"?
Exponential of the loss.
Why do we have to pass the vocabulary of the language model to the classifier data block?
We need the vocab correspondence of tokens to index to remain the same because we used the pretrained language model.
What is "gradual unfreezing"?
The unfreezing of one layer at a time and fine-tuning.
Why is text generation always likely to be ahead of automatic identification of machine-generated texts?
The text generation model could be made so that it competes with the identification model. Eventually, the text generation will produce text that the identification model cannot identify as being machine-generated.

Further Research

See what you can learn about language models and disinformation. What are the best language models today? Take a look at some of their outputs. Do you find them convincing? How could a bad actor best use such a model to create conflict and uncertainty?
Given the limitation that models are unlikely to be able to consistently recognize machine-generated texts, what other approaches may be needed to handle large-scale disinformation campaigns that leverage deep learning?

	text	text_
0	xxbos i strongly disagree with " xxunk " regarding xxmaj jim xxmaj belushi 's talent . i happen to like xxmaj belushi very much . xxmaj admittedly , i was skeptical when he first appeared on the scene , because i was such a xxup huge fan of his late brother xxmaj john . xxmaj but xxmaj jim has an on - screen charm that has gotten him very far -- and he has developed it well over the years	i strongly disagree with " xxunk " regarding xxmaj jim xxmaj belushi 's talent . i happen to like xxmaj belushi very much . xxmaj admittedly , i was skeptical when he first appeared on the scene , because i was such a xxup huge fan of his late brother xxmaj john . xxmaj but xxmaj jim has an on - screen charm that has gotten him very far -- and he has developed it well over the years .
1	is awesome . xxmaj there are some parts where you start to doubt whether the director intended to convey the message that showmanship is highly important thing in the future ( we will do such kind on corny sf things because we xxup can ) or is it simply over combining . xxmaj but the paranoia is there and feeling " out of joint " also . xxmaj good one . xxbos xxmaj first of all , the film is	awesome . xxmaj there are some parts where you start to doubt whether the director intended to convey the message that showmanship is highly important thing in the future ( we will do such kind on corny sf things because we xxup can ) or is it simply over combining . xxmaj but the paranoia is there and feeling " out of joint " also . xxmaj good one . xxbos xxmaj first of all , the film is very

epoch	train_loss	valid_loss	accuracy	perplexity	time
0	3.893486	3.772820	0.317104	43.502548	12:37
1	3.820479	3.717197	0.323790	41.148880	12:30
2	3.735622	3.659760	0.330321	38.851997	12:09
3	3.677086	3.624794	0.333960	37.516987	12:12
4	3.636646	3.601300	0.337017	36.645859	12:05
5	3.553636	3.584241	0.339355	36.026001	12:04
6	3.507634	3.571892	0.341353	35.583862	12:08
7	3.444101	3.565988	0.342194	35.374371	12:08
8	3.398597	3.566283	0.342647	35.384815	12:11
9	3.375563	3.568166	0.342528	35.451500	12:05

epoch	train_loss	valid_loss	accuracy	time
0	0.172888	0.153770	0.943120	01:01
1	0.161492	0.155567	0.942640	00:57

Lesson 10 - FastAI