Lesson 10 - FastAI
from fastai.text.all import *
path = untar_data(URLs.IMDB) #our data path
Path.BASE_PATH = path
path.ls()
(path/'train/pos').ls() #the path consists of text files
files = get_text_files(path, folders = ['train', 'test', 'unsup']) #lets grab the following folders
txt = files[0].open().read()
txt[:75]
spacy = WordTokenizer() #our tokenizer
toks = first(spacy([txt])) #tokenize our scentence and grab first
print(coll_repr(toks, 30))
first(spacy(['The U.S. dollar $1 is $1.00.']))
Notice U.S. and 1.00 is not seperated:This is one reason why tokenizers are useful.
tkn = Tokenizer(spacy) #Wrapper that designates special tokens
print(coll_repr(tkn(txt), 31))
xxbos:Beggining of text > xxmaj:Next word was capital > xxnuk:Next word is unknown > xxrep:Repeated words
Just know that anything that is xx___ is a special token
defaults.text_proc_rules #some more rules
coll_repr(tkn('© Fast.ai www.fast.ai/INDEX'), 31)
txts = L(o.open().read() for o in files[:2000])
def subword(sz):
sp = SubwordTokenizer(vocab_sz=sz)
sp.setup(txts) #trains subword token for the most commonly occuring words
return ' '.join(first(sp([txt]))[:40])
subword(1000)
Here words that are togather (no spaces between letters) are very common:Example, 'the' and 'pretty'
subword(200) #lets make the word vocab smaller
Notice that many words have yet to be identified. (Unlike film and with)
subword(10000) #lets train a much larger word vocab
Example below
toks200 = txts[:200].map(tkn)
toks200[0]
num = Numericalize() #returns tokens in order of freq
num.setup(toks200)
coll_repr(num.vocab,20)
toks200.map(num)[0][:10]
Now lets do it on our text
tkn = Tokenizer(WordTokenizer())
toks = tkn(txt)
print(coll_repr(tkn(txt), 31))
nums = num(toks)[:20] #numericalization
nums
Notice that these number values refer to the vocab. Lets decode below
' '.join(num.vocab[o] for o in nums) #we can decode doing the following
stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
tokens = tkn(stream)
bs,seq_len = 6,15
d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))
This batch is too big for our model, lets adjust it
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+seq_len:i*15+2*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+10:i*15+15] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))
nums200 = toks200.map(num)
dl = LMDataLoader(nums200)
This dataloader takes care of creating the appropraite minibatches for us
x,y = first(dl)
x.shape,y.shape
64 is batchsize, 72 is the seq length
' '.join(num.vocab[o] for o in x[0][:20])
' '.join(num.vocab[o] for o in y[0][:20])
Notice that the label is offset by 1 word:This is what we want
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])
dblock = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_imdb,
splitter=RandomSplitter(0.1)
)
dls_lm = dblock.dataloaders(path, path=path, bs=128, seq_len=80)
dls_lm.show_batch(max_n=2)
show_batch denumericalize for us, but in reality its numericalized. See below
dls_lm.one_batch()[0]
learn = language_model_learner(
dls_lm, AWD_LSTM, #AWD_LSTM is a precreated architecture
drop_mult=0.3, metrics=[accuracy, Perplexity()]).to_fp16()
learn.fit_one_cycle(1, 2e-2)
30% accuracy is actually not bad
learn.save('1epoch')
learn = learn.load('1epoch')
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)
learn.save_encoder('finetuned') #This saves the model without the final layer
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2 #lets generate 2 sentences
#randomness
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75)
for _ in range(N_SENTENCES)]
print("\n".join(preds))
dblock = DataBlock(
blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
get_y = parent_label,
get_items=partial(get_text_files, folders=['train', 'test']),
splitter=GrandparentSplitter(valid_name='test') #splits via folder name
)
dls_clas = dblock.dataloaders(path, path=path, bs=128, seq_len=72)
dls_clas.show_batch(max_n=3)
nums_samp = toks200[:10].map(num) #lets grab some reviews
nums_samp.map(len)
Notice they vary in lengths. This can be a problem, however, FastAI DataBlock takes care of it for us by using padding
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5,
metrics=accuracy).to_fp16()
learn = learn.load_encoder('finetuned') #lets load our model from before
learn.fit_one_cycle(1, 2e-2)
Notice how quickly it trained:This is the benefit of using pretrained models and only fitting on the final layer.
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
Then we can unfreeze a bit more, and continue training:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))
And finally, the whole model!
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))
This accuracy is very good!
- What is "self-supervised learning"?
Learning where model has no labels. - What is a "language model"?
A language model is a model that tries to predict the next word in a text. - Why is a language model considered self-supervised?
Because it does not require any labels needed to learn. - What are self-supervised models usually used for?
Often they are used as pre-trained model for transfer learning. - Why do we fine-tune language models?
By finetuning (final layers) we can fit a model to our data. Note, this assumes the data being fit on is similer. - What are the three steps to create a state-of-the-art text classifier?
Train a language model
Finetune language model on classification dataset
Finetune further as classifier - How do the 50,000 unlabeled movie reviews help us create a better text classifier for the IMDb dataset?
It has been trained to predict the next word: To do this, the model understands the language (Ex: sentiment). - What are the three steps to prepare your data for a language model?
Tokenization
Numericalization
DataLoader - What is "tokenization"? Why do we need it?
Tokenization splits words into a list: However, it's not that simple as it is vary of punctuations, syntax, etc. - Name three different approaches to tokenization.
Word-based tokenization
Subword-based tokenization
Character-based tokenization - What is
xxbos
?
Beginning of text - List four rules that fastai applies to text during tokenization.
xxrep, xxbox, xxcap, xxeos - Why are repeated characters replaced with a token showing the number of repetitions and the character that's repeated?
We can expect that repeated characters have special or different meaning than just a single character: Hence, why it is better to use a token to repersent this distinction. - What is "numericalization"?
The mapping of values to vocab - Why might there be words that are replaced with the "unknown word" token?
Such words make the embedding matrix far too large and increase memory usage. - With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer on the book's website.)
Minibatch of the nth row follows the previous minibatches nth row. - Why do we need padding for text classification? Why don't we need it for language modeling?
Padding is needed because each text is of different sizes. It is not required for language modeling as the documents are all concatenated. - What does an embedding matrix for NLP contain? What is its shape?
It contains vector representations of all tokens in the vocabulary. The embedding matrix has the size vocab_size x embedding_size. - What is "perplexity"?
Exponential of the loss. - Why do we have to pass the vocabulary of the language model to the classifier data block?
We need the vocab correspondence of tokens to index to remain the same because we used the pretrained language model. - What is "gradual unfreezing"?
The unfreezing of one layer at a time and fine-tuning. - Why is text generation always likely to be ahead of automatic identification of machine-generated texts?
The text generation model could be made so that it competes with the identification model. Eventually, the text generation will produce text that the identification model cannot identify as being machine-generated.
- See what you can learn about language models and disinformation. What are the best language models today? Take a look at some of their outputs. Do you find them convincing? How could a bad actor best use such a model to create conflict and uncertainty?
- Given the limitation that models are unlikely to be able to consistently recognize machine-generated texts, what other approaches may be needed to handle large-scale disinformation campaigns that leverage deep learning?