BERT and RoBERTA
- BERT (Bidirectional Encoder Representations from Transformers) is a transformer based model and a tool designed to understand language.
- its considered to understand language better than other models
- BERT is freely available and used for big number of language tasks
- Google Search uses BERT
Architecture:
- BERT is based on Transformer model architecture.
- A basic transformer consists of an encoder and a decoder.
- Since BERT is used to generate language representation model, it only needs that encoder part, hence BERT is basically a trained Transformer Encoder stack.
- As an encoder it takes in a sequence of tokens and produces a fixed-size vecor representation of the entire sequence
Training:
- Masked Language Modeling (MLM): 15% of words in each sequence are replaced with a [MASK] token the model then tries to predict the original value of the masked words based on the surrounding context this allows bidirectional training. Prediction requires adding a classifier layer, Multiplying output vecotrs by the embedding matrix and calculating probability with softmax
- Next Sentence Prediction (NSP): A binary classification loss for predicting two segments follow each other or from entirely different documents creating semantic meaning
- The original input format used segment pairs with NSP loss where the total length was less than 512 tokens
Features:
- Bidirectional context: considers both left and right context providing rich embeddings
- Transformer based: uses transformer attention mech to learn context relations between words
- State of art Perf: excels in various NLP benchmarks
Disadvantages:
- Resource intensive
- Training time
RoBERTa (Robustly Optimized BERT Pretraining Approach)
- Advanced version of BERT to be fast as fuck boi
- powerful and effective language model
- transformer based model that also does the same shit
- same architecture as BERT
Training:
- significantly larger dataset: 160Gb of text (over 10x BERT)
- no NSP shit because of performance
- Larger Batch sizes and longer sequence which gave larger mini batches and higher learning rates
- Dynamic Masking during training so everytime the same data is loaded new shit is masked
- ByteLevel BPE replaces the character level BPE with larger Bytelevel BPE tokenizer
- trained for harder stronger faster
Limitations:
- takes way too much gpu and memory
- may require specific domain data for performance
- requires shitload of data for pretraining limiting the in your phone
- difficult in interpreting inner workings and diagnosing errors
- can inhert baises from training data
- may struggle with too new shit