14.19 BERT and RoBERTa

BERT (Bidirectional Encoder Representations from Transformers) is a transformer based model and a tool designed to understand language.
its considered to understand language better than other models
BERT is freely available and used for big number of language tasks
Google Search uses BERT

BERT is based on Transformer model architecture.
A basic transformer consists of an encoder and a decoder.
Since BERT is used to generate language representation model, it only needs that encoder part, hence BERT is basically a trained Transformer Encoder stack.
As an encoder it takes in a sequence of tokens and produces a fixed-size vecor representation of the entire sequence

Masked Language Modeling (MLM): 15% of words in each sequence are replaced with a [MASK] token the model then tries to predict the original value of the masked words based on the surrounding context this allows bidirectional training. Prediction requires adding a classifier layer, Multiplying output vecotrs by the embedding matrix and calculating probability with softmax
Next Sentence Prediction (NSP): A binary classification loss for predicting two segments follow each other or from entirely different documents creating semantic meaning
The original input format used segment pairs with NSP loss where the total length was less than 512 tokens

Bidirectional context: considers both left and right context providing rich embeddings
Transformer based: uses transformer attention mech to learn context relations between words
State of art Perf: excels in various NLP benchmarks

significantly larger dataset: 160Gb of text (over 10x BERT)
no NSP shit because of performance
Larger Batch sizes and longer sequence which gave larger mini batches and higher learning rates
Dynamic Masking during training so everytime the same data is loaded new shit is masked
ByteLevel BPE replaces the character level BPE with larger Bytelevel BPE tokenizer
trained for harder stronger faster