NLP Deep Dive: RNNs

We are now going to take a look into natural language processing. Were going to build two models: One that can predict the next word (generate text), and another that can classify if a text is positive or negative. Note: We will be using a movie review dataset for this model.

Grab path

from fastai.text.all import *
path = untar_data(URLs.IMDB) #our data path
Path.BASE_PATH = path
path.ls() 
(#7) [Path('train'),Path('imdb.vocab'),Path('tmp_lm'),Path('unsup'),Path('tmp_clas'),Path('README'),Path('test')]
(path/'train/pos').ls() #the path consists of text files
(#12500) [Path('train/pos/5840_7.txt'),Path('train/pos/7429_9.txt'),Path('train/pos/8401_10.txt'),Path('train/pos/4606_7.txt'),Path('train/pos/11152_10.txt'),Path('train/pos/11180_7.txt'),Path('train/pos/11887_8.txt'),Path('train/pos/8072_10.txt'),Path('train/pos/5256_10.txt'),Path('train/pos/6267_10.txt')...]
files = get_text_files(path, folders = ['train', 'test', 'unsup']) #lets grab the following folders
txt = files[0].open().read() 
txt[:75] 
'While the premise of the film is pretty lame (Ollie is diagnosed with "horn'

Word Tokenization with FastAI

To store the words, will be using a tokenizer. There are many benefits to using tokenizers as you will see below.

spacy = WordTokenizer()  #our tokenizer
toks = first(spacy([txt])) #tokenize our scentence and grab first 
print(coll_repr(toks, 30))
(#365) ['While','the','premise','of','the','film','is','pretty','lame','(','Ollie','is','diagnosed','with','"','hornophobia','"',')',',','the','film','is','an','amiable','and','enjoyable','little','flick','.','It'...]
first(spacy(['The U.S. dollar $1 is $1.00.'])) 
(#9) ['The','U.S.','dollar','$','1','is','$','1.00','.']

Notice U.S. and 1.00 is not seperated:This is one reason why tokenizers are useful.

tkn = Tokenizer(spacy) #Wrapper that designates special tokens
print(coll_repr(tkn(txt), 31))
(#403) ['xxbos','xxmaj','while','the','premise','of','the','film','is','pretty','lame','(','ollie','is','diagnosed','with','"','hornophobia','"',')',',','the','film','is','an','amiable','and','enjoyable','little','flick','.'...]

xxbos:Beggining of text > xxmaj:Next word was capital > xxnuk:Next word is unknown > xxrep:Repeated words
Just know that anything that is xx___ is a special token

defaults.text_proc_rules #some more rules
[<function fastai.text.core.fix_html(x)>,
 <function fastai.text.core.replace_rep(t)>,
 <function fastai.text.core.replace_wrep(t)>,
 <function fastai.text.core.spec_add_spaces(t)>,
 <function fastai.text.core.rm_useless_spaces(t)>,
 <function fastai.text.core.replace_all_caps(t)>,
 <function fastai.text.core.replace_maj(t)>,
 <function fastai.text.core.lowercase(t, add_bos=True, add_eos=False)>]
coll_repr(tkn('&copy;   Fast.ai www.fast.ai/INDEX'), 31)
"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"

Sidebar: Subword Tokenization

Subword tokenization is a new tokenizer that determines words not based on spaces but frequency. This is actually better than the WordTokenizer as it can determine words from character based languages that lack spaces (Ex: Chinese).

txts = L(o.open().read() for o in files[:2000])
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts) #trains subword token for the most commonly occuring words
    
    return ' '.join(first(sp([txt]))[:40])
subword(1000)
'▁Whil e ▁the ▁pre m ise ▁of ▁the ▁film ▁is ▁pretty ▁la me ▁( O ll ie ▁is ▁di ag no s ed ▁with ▁" h or n op ho b ia " ), ▁the ▁film ▁is ▁an ▁a mi'

Here words that are togather (no spaces between letters) are very common:Example, 'the' and 'pretty'

subword(200) #lets make the word vocab smaller
'▁ W h i le ▁the ▁p re m is e ▁of ▁the ▁film ▁is ▁p re t t y ▁ la m e ▁ ( O ll i e ▁is ▁d i a g n o s ed ▁with'

Notice that many words have yet to be identified. (Unlike film and with)

subword(10000) #lets train a much larger word vocab
'▁Whil e ▁the ▁premise ▁of ▁the ▁film ▁is ▁pretty ▁lame ▁( O ll ie ▁is ▁diagnos ed ▁with ▁" h or no pho b ia ") , ▁the ▁film ▁is ▁an ▁a mi able ▁and ▁enjoyable ▁little ▁flick . ▁It'

End Sidebar

Numericalization with fastai

Numericalization alters tokens so that only numerical values are within the lists: These values refer to the vocab.

Example below

toks200 = txts[:200].map(tkn)
toks200[0]
(#403) ['xxbos','xxmaj','while','the','premise','of','the','film','is','pretty'...]
num = Numericalize() #returns tokens in order of freq
num.setup(toks200)
coll_repr(num.vocab,20)
"(#2200) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','and','a','of','to','is','in','it','i'...]"
toks200.map(num)[0][:10]
TensorText([  2,   8, 171,   9,   0,  14,   9,  29,  16, 188])

Now lets do it on our text

tkn = Tokenizer(WordTokenizer())
toks = tkn(txt)
print(coll_repr(tkn(txt), 31))
(#403) ['xxbos','xxmaj','while','the','premise','of','the','film','is','pretty','lame','(','ollie','is','diagnosed','with','"','hornophobia','"',')',',','the','film','is','an','amiable','and','enjoyable','little','flick','.'...]
nums = num(toks)[:20]  #numericalization
nums
TensorText([   2,    8,  171,    9,    0,   14,    9,   29,   16,  188, 1243,   33, 1244,   16,    0,   27,   24,    0,   24,   32])

Notice that these number values refer to the vocab. Lets decode below

' '.join(num.vocab[o] for o in nums) #we can decode doing the following
'xxbos xxmaj while the xxunk of the film is pretty lame ( ollie is xxunk with " xxunk " )'

Creating Batches for Language Model

stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
tokens = tkn(stream)
bs,seq_len = 6,15
d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))
xxbos xxmaj in this chapter , we will go back over the example of classifying
movie reviews we studied in chapter 1 and dig deeper under the surface . xxmaj
first we will look at the processing steps necessary to convert text into numbers and
how to customize it . xxmaj by doing this , we 'll have another example
of the preprocessor used in the data block xxup api . \n xxmaj then we
will study how we build a language model and train it for a while .

This batch is too big for our model, lets adjust it

Minibatches

Below is an example of 3 minibatches. Notice that in each following minibatch the nth row follows the previous minibatches nth row.Example,

M1: xxbos xxmaj in this chapter
M2: , we will go back
M3: over the example of classifying

bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))
xxbos xxmaj in this chapter
movie reviews we studied in
first we will look at
how to customize it .
of the preprocessor used in
will study how we build
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+seq_len:i*15+2*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))
, we will go back
chapter 1 and dig deeper
the processing steps necessary to
xxmaj by doing this ,
the data block xxup api
a language model and train
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+10:i*15+15] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))
over the example of classifying
under the surface . xxmaj
convert text into numbers and
we 'll have another example
. \n xxmaj then we
it for a while .
nums200 = toks200.map(num)
dl = LMDataLoader(nums200)

This dataloader takes care of creating the appropraite minibatches for us

x,y = first(dl)
x.shape,y.shape
(torch.Size([64, 72]), torch.Size([64, 72]))

64 is batchsize, 72 is the seq length

' '.join(num.vocab[o] for o in x[0][:20])
'xxbos xxmaj while the xxunk of the film is pretty lame ( ollie is xxunk with " xxunk " )'
' '.join(num.vocab[o] for o in y[0][:20])
'xxmaj while the xxunk of the film is pretty lame ( ollie is xxunk with " xxunk " ) ,'

Notice that the label is offset by 1 word:This is what we want

Language Model Using DataBlock

get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dblock = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, 
    splitter=RandomSplitter(0.1)
)
dls_lm = dblock.dataloaders(path, path=path, bs=128, seq_len=80)
dls_lm.show_batch(max_n=2)
text text_
0 xxbos i strongly disagree with " xxunk " regarding xxmaj jim xxmaj belushi 's talent . i happen to like xxmaj belushi very much . xxmaj admittedly , i was skeptical when he first appeared on the scene , because i was such a xxup huge fan of his late brother xxmaj john . xxmaj but xxmaj jim has an on - screen charm that has gotten him very far -- and he has developed it well over the years i strongly disagree with " xxunk " regarding xxmaj jim xxmaj belushi 's talent . i happen to like xxmaj belushi very much . xxmaj admittedly , i was skeptical when he first appeared on the scene , because i was such a xxup huge fan of his late brother xxmaj john . xxmaj but xxmaj jim has an on - screen charm that has gotten him very far -- and he has developed it well over the years .
1 is awesome . xxmaj there are some parts where you start to doubt whether the director intended to convey the message that showmanship is highly important thing in the future ( we will do such kind on corny sf things because we xxup can ) or is it simply over combining . xxmaj but the paranoia is there and feeling " out of joint " also . xxmaj good one . xxbos xxmaj first of all , the film is awesome . xxmaj there are some parts where you start to doubt whether the director intended to convey the message that showmanship is highly important thing in the future ( we will do such kind on corny sf things because we xxup can ) or is it simply over combining . xxmaj but the paranoia is there and feeling " out of joint " also . xxmaj good one . xxbos xxmaj first of all , the film is very

show_batch denumericalize for us, but in reality its numericalized. See below

dls_lm.one_batch()[0]
LMTensorText([[    2,     8,   121,  ...,    42,    13,   190],
        [   23,     9,   522,  ...,    13,  9706,   359],
        [35022,    48,   121,  ...,    15,   159,    10],
        ...,
        [ 2202,     8, 22400,  ...,  6995,    13,   650],
        [33649,     8,  2712,  ...,    14,    21,   898],
        [   16,    36,    10,  ...,    28,    45,   734]], device='cuda:0')

Training

learn = language_model_learner(
    dls_lm, AWD_LSTM,  #AWD_LSTM is a precreated architecture
    drop_mult=0.3, metrics=[accuracy, Perplexity()]).to_fp16()
learn.fit_one_cycle(1, 2e-2)
epoch train_loss valid_loss accuracy perplexity time
0 4.120048 3.912788 0.299565 50.038246 11:39

30% accuracy is actually not bad

Saving and Loading Models

learn.save('1epoch')
learn = learn.load('1epoch')

Further training

learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)
epoch train_loss valid_loss accuracy perplexity time
0 3.893486 3.772820 0.317104 43.502548 12:37
1 3.820479 3.717197 0.323790 41.148880 12:30
2 3.735622 3.659760 0.330321 38.851997 12:09
3 3.677086 3.624794 0.333960 37.516987 12:12
4 3.636646 3.601300 0.337017 36.645859 12:05
5 3.553636 3.584241 0.339355 36.026001 12:04
6 3.507634 3.571892 0.341353 35.583862 12:08
7 3.444101 3.565988 0.342194 35.374371 12:08
8 3.398597 3.566283 0.342647 35.384815 12:11
9 3.375563 3.568166 0.342528 35.451500 12:05
learn.save_encoder('finetuned') #This saves the model without the final layer

Testing model

We can now test our model. Because our model is made to predict the next word, it can generate text given any input. This generated text is the prediction of the model.

TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2 #lets generate 2 sentences

                                        #randomness
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]
print("\n".join(preds))
i liked this movie because of its story and characters . The story line was very strong , very good for a sci - fi film . The main character , Alucard , was very well developed and brought the whole story
i liked this movie because i like the idea of the premise of the movie , the ( very ) convenient virus ( which , when you have to kill a few people , the " evil " machine has to be used to protect

Text Classifier

Now lets create a text classifier, which can classify if the text is positive or negative. For this we will be using our pretrained model that we saved above.

Classifier DataBlock

dblock = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test') #splits via folder name
)
dls_clas = dblock.dataloaders(path, path=path, bs=128, seq_len=72)
dls_clas.show_batch(max_n=3)
text category
0 xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero pos
1 xxbos * ! ! - xxup spoilers - ! ! * \n\n xxmaj before i begin this , let me say that i have had both the advantages of seeing this movie on the big screen and of having seen the " authorized xxmaj version " of this movie , remade by xxmaj stephen xxmaj king , himself , in 1997 . \n\n xxmaj both advantages made me appreciate this version of " the xxmaj shining , " all the more . \n\n xxmaj also , let me say that xxmaj i 've read xxmaj mr . xxmaj king 's book , " the xxmaj shining " on many occasions over the years , and while i love the book and am a huge fan of his work , xxmaj stanley xxmaj kubrick 's retelling of this story is far more compelling … and xxup scary . \n\n xxmaj kubrick pos
2 xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n\n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , steaming bowl of oatmeal . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj victor xxmaj vargas became i was always aware that something did n't quite feel right . xxmaj victor xxmaj vargas suffers from a certain overconfidence on the director 's part . xxmaj apparently , the director thought that the ethnic backdrop of a xxmaj latino family on the lower east side , and an idyllic storyline would make the film critic proof . xxmaj he was right , but it did n't fool me . xxmaj raising xxmaj victor xxmaj vargas is neg
nums_samp = toks200[:10].map(num) #lets grab some reviews
nums_samp.map(len)
(#10) [403,176,151,63,185,905,417,97,183,397]

Notice they vary in lengths. This can be a problem, however, FastAI DataBlock takes care of it for us by using padding

learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                                metrics=accuracy).to_fp16()
learn = learn.load_encoder('finetuned') #lets load our model from before

Fine-Tuning the Classifier

learn.fit_one_cycle(1, 2e-2)
epoch train_loss valid_loss accuracy time
0 0.347427 0.184480 0.929320 00:33

Notice how quickly it trained:This is the benefit of using pretrained models and only fitting on the final layer.

Refining

Lets refine the model by training some more. For NLP it's better to only freeze a couple of layers at a time, rather than the entire thing. So, below we can do this by calling .freeze_to(-2), which freeze all except the last two parameter groups:

learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
epoch train_loss valid_loss accuracy time
0 0.247763 0.171683 0.934640 00:37

Then we can unfreeze a bit more, and continue training:

learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))
epoch train_loss valid_loss accuracy time
0 0.193377 0.156696 0.941200 00:45

And finally, the whole model!

learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))
epoch train_loss valid_loss accuracy time
0 0.172888 0.153770 0.943120 01:01
1 0.161492 0.155567 0.942640 00:57

This accuracy is very good!

Conclusion

Overall, natural language models are very powerful and benefitial. Hopefully, you learned how to create a model that can generate text, and more importantly, can be used to classify text as well (Transfer learning).

Questionnaire

  1. What is "self-supervised learning"?
    Learning where model has no labels.
  2. What is a "language model"?
    A language model is a model that tries to predict the next word in a text.
  3. Why is a language model considered self-supervised?
    Because it does not require any labels needed to learn.
  4. What are self-supervised models usually used for?
    Often they are used as pre-trained model for transfer learning.
  5. Why do we fine-tune language models?
    By finetuning (final layers) we can fit a model to our data. Note, this assumes the data being fit on is similer.
  6. What are the three steps to create a state-of-the-art text classifier?
    Train a language model
    Finetune language model on classification dataset
    Finetune further as classifier
  7. How do the 50,000 unlabeled movie reviews help us create a better text classifier for the IMDb dataset?
    It has been trained to predict the next word: To do this, the model understands the language (Ex: sentiment).
  8. What are the three steps to prepare your data for a language model?
    Tokenization
    Numericalization
    DataLoader
  9. What is "tokenization"? Why do we need it?
    Tokenization splits words into a list: However, it's not that simple as it is vary of punctuations, syntax, etc.
  10. Name three different approaches to tokenization.
    Word-based tokenization
    Subword-based tokenization
    Character-based tokenization
  11. What is xxbos?
    Beginning of text
  12. List four rules that fastai applies to text during tokenization.
    xxrep, xxbox, xxcap, xxeos
  13. Why are repeated characters replaced with a token showing the number of repetitions and the character that's repeated?
    We can expect that repeated characters have special or different meaning than just a single character: Hence, why it is better to use a token to repersent this distinction.
  14. What is "numericalization"?
    The mapping of values to vocab
  15. Why might there be words that are replaced with the "unknown word" token?
    Such words make the embedding matrix far too large and increase memory usage.
  16. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer on the book's website.)
    Minibatch of the nth row follows the previous minibatches nth row.
  17. Why do we need padding for text classification? Why don't we need it for language modeling?
    Padding is needed because each text is of different sizes. It is not required for language modeling as the documents are all concatenated.
  18. What does an embedding matrix for NLP contain? What is its shape?
    It contains vector representations of all tokens in the vocabulary. The embedding matrix has the size vocab_size x embedding_size.
  19. What is "perplexity"?
    Exponential of the loss.
  20. Why do we have to pass the vocabulary of the language model to the classifier data block?
    We need the vocab correspondence of tokens to index to remain the same because we used the pretrained language model.
  21. What is "gradual unfreezing"?
    The unfreezing of one layer at a time and fine-tuning.
  22. Why is text generation always likely to be ahead of automatic identification of machine-generated texts?
    The text generation model could be made so that it competes with the identification model. Eventually, the text generation will produce text that the identification model cannot identify as being machine-generated.

Further Research

  1. See what you can learn about language models and disinformation. What are the best language models today? Take a look at some of their outputs. Do you find them convincing? How could a bad actor best use such a model to create conflict and uncertainty?
  2. Given the limitation that models are unlikely to be able to consistently recognize machine-generated texts, what other approaches may be needed to handle large-scale disinformation campaigns that leverage deep learning?