Training Transformers

From random to intelligent

A freshly initialized transformer is just random noise. Every weight, every connection—just numbers drawn from a probability distribution. It knows nothing about language, about the world, about anything. Yet somehow, after seeing enough text, it becomes capable of coherent generation.

How does a pile of random numbers become something that can write poetry, explain code, or hold a conversation?

The Training Objective

Training begins with a simple question: what should the model learn to do? For transformers, two objectives dominate.

Causal Language Modeling (CLM) asks the model to predict the next token given all previous tokens. This is how GPT models learn. Given "The cat sat on the," predict "mat." Given "def fibonacci(n):", predict what comes next in the code.

Masked Language Modeling (MLM) takes a different approach. Hide some tokens in the middle of a sentence and ask the model to fill them in. This is BERT's training objective. Given "The [MASK] sat on the mat," predict that the hidden word is "cat."

Both approaches share a remarkable property: they are self-supervised. No human needs to label the data. The text itself provides the supervision. Every book, every webpage, every code repository becomes training material automatically.

The Simplicity of Next-Token Prediction

Here is the insight that underlies modern language models: next-token prediction, repeated billions of times, teaches everything.

Think about what you need to know to predict the next word accurately. "The capital of France is..." To predict "Paris," you need geographical knowledge. "The derivative of x² is..." To predict "2x," you need calculus. "She felt a wave of..." To predict "relief" or "sadness," you need emotional understanding.

The model does not set out to learn grammar. It does not try to memorize facts. It simply optimizes for prediction. But along the way, grammar helps prediction. Facts help prediction. Reasoning helps prediction. So the model learns all of these, not as explicit goals, but as useful tools for its actual objective.

Interactive: Next Token Prediction

Context

The capital of France is|

Next Token Probabilities

The model outputs a probability distribution over all possible next tokens. The green bar shows the most likely prediction. During training, the model learns to assign high probability to correct continuations.

Watch how the model assigns probabilities to possible continuations. The highest probability token becomes the prediction. Over billions of examples, the model learns which continuations are most likely in which contexts.

The Loss Function

The model outputs a probability distribution over all possible next tokens. The loss function measures how surprised the model is by the actual next token.

This is cross-entropy loss:

L=logP(true token)\mathcal{L} = -\log P(\text{true token})

If the model assigns 90% probability to the correct next token, the loss is low: log(0.9)0.1-\log(0.9) \approx 0.1. If it assigns only 1% probability, the loss is high: log(0.01)4.6-\log(0.01) \approx 4.6.

The training objective is to minimize this surprise across all tokens in all training documents. Make the model less surprised by correct continuations, more confident in its predictions.

Interactive: Loss Curve

Step 0 / 100

Watch the loss decrease as the model trains. Early progress is rapid, then improvements become incremental as the model learns the easy patterns first.

Watch the loss decrease over training. Each step nudges the parameters slightly, making the model a bit less surprised by real text. Early in training, progress is rapid. Later, improvements become incremental as the model has already learned the easy patterns.

Training at Scale

Training a large language model is an engineering marathon. Consider GPT-3's training:

  • 175 billion parameters to update
  • 300 billion tokens of training data
  • Thousands of GPUs working in parallel
  • Months of compute time

Several techniques make this possible:

Batch Size and Learning Rate work together. Larger batches provide more stable gradient estimates, allowing higher learning rates. But batches that are too large can hurt generalization. Modern training often uses batch sizes in the hundreds of thousands.

Learning Rate Schedules vary the step size over training. A common pattern: start small (warmup), increase to peak, then slowly decay. This helps the model settle into good solutions rather than bouncing around.

Data Parallelism splits each batch across multiple GPUs. Each GPU computes gradients on its slice, then all GPUs synchronize to update the shared parameters.

Model Parallelism becomes necessary when a model does not fit on one GPU. Different layers live on different devices, passing activations between them.

The training data itself matters enormously. Modern models train on curated mixes of:

  • Web text (filtered for quality)
  • Books and literature
  • Scientific papers
  • Code repositories
  • Conversation data

The exact mix and filtering are closely guarded secrets. Data quality often matters more than quantity.

The Emergence of Capabilities

Something remarkable happens during training. The model starts as random noise. Early in training, it learns basic patterns: common words, simple grammar. Then it picks up syntax, learns that verbs follow subjects, that sentences have structure.

As training continues, more sophisticated capabilities emerge. The model learns facts about the world. It learns to follow instructions. It learns to reason, at least in ways that look like reasoning.

None of these capabilities were explicitly programmed. They emerged from the simple pressure to predict the next token better than before.

Key Takeaways

  • Transformers learn through simple objectives: predict the next token (CLM) or fill in masked tokens (MLM)
  • These are self-supervised—the text itself provides the training signal, no human labels needed
  • Cross-entropy loss measures how "surprised" the model is by the true next token
  • Training at scale requires careful orchestration: batch sizes, learning rate schedules, distributed computing
  • Capabilities emerge from the pressure to predict well—grammar, facts, reasoning all help prediction