This is my learning notes on BERT and NLP from Jay Alammar. I like how he illustrates abstract concepts with a clear and detailed interpretation. NLP tasks in general interest me a lot because this is a beautiful combination of profound intelligence and omnipresent application.

What is BERT

BERT stands for Bidirectional Encoder Representations from Transformers. It is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks. reference


  1. It is bidirectional
  2. It combines Mask Language Model (MLM) and Next Sentence Prediction (NSP)
  3. It performs great to understand context-heavy texts

A closer look at model architecture

BERT is basically a trained bidirectional Transformer Encoder stack.

So… What is Transformer?

Transformer is a model that use attention to boost the training. It consists an encoding component and a decoding component, and connections between them.

The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word.

This concept itself is interesting and complicated. More information can be found here

Several Models before BERT

Word Embedding
  • What is it: use a vector (a list of numbers) to properly represent words in a way that captures semantic or meaning-related relationships (e.g. the ability to tell if words are similar, or opposites, or that a pair of words like “Stockholm” and “Sweden” have the same relationship between them as “Cairo” and “Egypt” have between them) as well as syntactic, or grammar-based, relationships (e.g. the relationship between “had” and “has” is the same as that between “was” and “is”).
  • Available in pre-trained model like Word2Vec or GloVe
  • No context information
ELMo: Context Matters
  • Give embedding based on the context a word is used in
  • ELMo looks at the entire sentence before assigning each word in it an embedding. It uses a bi-directional LSTM trained on a specific task to be able to create those embeddings. ELMo gained its language understanding from being trained to predict the next word in a sequence of words - a task called Language Modeling.
  • Use as pre-trained model
OpenAI Transformer: Pre-training a Transformer Decoder for Language Modeling
  • The decoder is a good choice because it’s a natural choice for language modeling (predicting the next word) since it’s built to mask future tokens – a valuable feature when it’s generating a translation word by word.
  • Predict the next word using massive (unlabeled) datasets.
  • The openAI transformer gave us a fine-tunable pre-trained model based on the Transformer

BERT: From Decoders to Encoders

BERT training

  1. Some tokens from the input sequence are masked and the model learns to predict these words (Masked language model).
  2. Two “sentences” are fed as input and the model is trained to predict if one sentence follows the other one or not (next sentence prediction NSP).
  3. So we’ll feed BERT with two sentences masked, and we’ll obtain the prediction whether they’re subsequent or not, and the sentences without masked words
So.. how to mask?

From each input sequence 15% of the tokens are processed as follows:

  • with 0.8 probability the token is replaced by [MASK]
  • with 0.1 probability the token is replaced by another random token
  • with 0.1 probability the token is unchanged
Next sentence prediction

These two sentences A and B are separated with the special token [SEP] and are formed in such a way that 50% of the time B is the actual next sentence and 50% of the time is a random sentence.

BERT input

The input sequence of BERT is composed by two sentences with a [SEP] token in between, and the initial “classification token” [CLS] that will later be used for prediction. Each token has a corresponding embedding, a segment embedding that identifies each sentence, and a position embedding to distinguish the position of each token (same as the positional encoding in the Transformer paper). All these embeddings are then summed up for each token.

BERT Outputs

Each position outputs a vector of size hidden_size (768 in BERT Base). For a sentence classification example , we focus on the output of only the first position (that we passed the special [CLS] token to).

Use pre-trained fine-tuned models for specific tasks

BERT is a language model that can be used directly to approach other NLP tasks (summarization, question answering, etc.).

Use BERT for feature extraction

Next step

Give BERT a try! I plan to take notes on this topic in my next post based on this post