BERT Deep Dive

Bidirectional understanding

BERT's Key Insight

BERT stands for Bidirectional Encoder Representations from Transformers. The name tells you nearly everything: this is an encoder-only model that reads text in both directions simultaneously.

Unlike the original transformer—which used both an encoder and decoder—BERT throws away the decoder entirely. There is no autoregressive generation here. BERT does not predict the next word. Instead, it builds a deep, contextualized understanding of text.

The breakthrough insight: every token sees every other token.

BERT Architecture

Hover over an input token to see bidirectional attention. Each token attends to all other tokens simultaneously—both left and right.

When BERT processes the sentence "The cat sat on the mat," every word attends to every other word. The word "sat" sees both "cat" (to its left) and "mat" (to its right). This bidirectional attention allows BERT to capture context from the entire sentence at once.

This is fundamentally different from language models that read left-to-right. A left-to-right model processing "The bank" cannot yet know whether we mean a financial institution or a riverbank. BERT sees the whole sentence before making that decision.

Masked Language Modeling

If BERT sees everything at once, how do you train it? You cannot use next-token prediction because the model already sees all the tokens.

The solution: masked language modeling (MLM).

During training, 15% of tokens are randomly masked—replaced with a special [MASK] token. The model's task is to predict what the original tokens were.

"The cat [MASK] on the mat" → "sat"

Masked Language Modeling

Thecat[MASK]onthemat

BERT's predictions for mask 1:

sat72%

was11%

lay8%

slept5%

BERT learns to predict masked words using context from both directions. Click on a [MASK] token to see predictions, then reveal BERT's guesses.

The key point: to predict a masked word, BERT must understand context from both directions. To guess that [MASK] should be "sat," the model needs to understand that cats sit, that mats are things you sit on, and that the surrounding grammar calls for a past-tense verb.

This simple training objective teaches the model grammar, facts, common sense, and nuanced word meanings—all without any labeled data.

Next Sentence Prediction

BERT's training includes a second objective: next sentence prediction (NSP).

Given two sentences, the model predicts whether they were originally consecutive in the source text. Half the training examples are real consecutive sentences; half are random pairings.

Sentence A: "The weather is nice today."
Sentence B: "I think I'll go for a walk."
Label: IsNext ✓

This task was designed to help BERT understand document-level relationships and multi-sentence reasoning.

Later research (RoBERTa, 2019) showed that NSP might not be as important as originally thought. Some BERT variants drop it entirely with no loss in performance. But the original BERT paper included it, and it remains part of the canonical BERT training recipe.

Pre-training and Fine-tuning

BERT introduced a paradigm that transformed NLP: pre-train once, fine-tune many times.

Pre-training and Fine-tuning Flow

Stage 1: Pre-training

Train on massive unlabeled text (Wikipedia, books)

Data: Billions of tokens

Time: Days to weeks

Task: Masked Language Modeling

Pre-training is expensive but happens once. Fine-tuning is fast and cheap—the same BERT model can be adapted to countless tasks.

Pre-training happens once. The model trains on massive unlabeled text—Wikipedia, books, web pages—learning to predict masked tokens. This phase takes days or weeks on expensive hardware, but the resulting model understands language deeply.

Fine-tuning adapts the pre-trained model to specific tasks. You add a small task-specific layer on top of BERT and train on a labeled dataset. Fine-tuning is fast—often minutes to hours—and requires relatively little data.

The result: a model pre-trained on general text can be fine-tuned to excel at tasks it was never explicitly trained for. Sentiment analysis, question answering, named entity recognition—all become possible with the same foundation.

When BERT was released in 2018, it broke records on 11 NLP benchmarks simultaneously. The pre-train/fine-tune paradigm became the default approach for nearly all NLP tasks.

BERT Variants

The original BERT came in two sizes:

Model	Layers	Hidden Size	Params
BERT-base	12	768	110M
BERT-large	24	1024	340M

The success of BERT spawned many variants:

RoBERTa (2019) trained longer on more data, removed NSP, and used dynamic masking. It consistently outperformed the original BERT.

ALBERT (2019) reduced parameters through cross-layer sharing and factorized embeddings, achieving strong performance with far fewer parameters.

DistilBERT (2019) used knowledge distillation to compress BERT into a model 40% smaller and 60% faster, while retaining 97% of its performance.

XLNet, ELECTRA, DeBERTa, and others introduced further innovations. But all trace their lineage back to BERT's core insight: bidirectional attention trained with masked language modeling.

Use Cases

BERT excels at tasks that require understanding text, not generating it:

Text Classification: Given a document, predict its category. Spam detection, sentiment analysis, topic classification. BERT's contextual embeddings capture nuance that bag-of-words models miss.

Named Entity Recognition (NER): Identify people, places, organizations, and other entities in text. BERT's bidirectional context helps disambiguate entities based on surrounding words.

Question Answering: Given a passage and a question, find the answer span in the passage. BERT processes both simultaneously, attending between the question and relevant parts of the passage.

Semantic Similarity: Determine how similar two sentences are in meaning. BERT embeddings can be compared directly or fine-tuned for similarity scoring.

BERT fundamentally changed what was possible in NLP. Tasks that once required years of feature engineering could now be solved by fine-tuning a pre-trained model. The next chapter explores GPT—BERT's decoder-only counterpart that took a different path to even larger scales.

Key Takeaways

BERT is an encoder-only transformer that processes text bidirectionally
Masked language modeling trains the model to predict randomly hidden tokens
The pre-train/fine-tune paradigm allows one model to excel at many tasks
BERT excels at understanding tasks: classification, NER, QA, similarity
BERT spawned numerous variants (RoBERTa, ALBERT, DistilBERT) that built on its core ideas