From Words to Numbers

Tokenization and embeddings

Why We Can't Just Number Words

Neural networks speak the language of numbers. They perform matrix multiplications, compute gradients, and apply activation functions—all operations that require numerical inputs. But our data is text: strings of characters forming words forming sentences.

The naive approach might be to assign each word a number. "The" is 1, "cat" is 2, "sat" is 3. Simple, right?

The problem is that numbers carry meaning. The number 2 sits between 1 and 3. It's closer to 1 than to 100. But does "cat" sit between "the" and "sat"? Is "cat" closer to "the" than to "elephant"? These numerical relationships are completely arbitrary and misleading.

When you feed numbers into a neural network, it performs operations like averaging. The average of 2 and 6 is 4. If "cat" is 2 and "dog" is 6, is word 4 somehow the "average" of a cat and dog? Of course not. The numbering scheme imposes false structure.

We need a representation that lets similar things be numerically similar—where the distance between "cat" and "dog" is small, while the distance between "cat" and "democracy" is large.

Tokenization

Before we can represent words as numbers, we need to decide what counts as a "word." This process is called tokenization—breaking text into discrete units called tokens.

The simplest approach is word-level tokenization: split on spaces and punctuation. "The cat sat." becomes ["The", "cat", "sat", "."]. This works, but has a major flaw: your vocabulary explodes. Every misspelling, every rare proper noun, every technical term needs its own entry. You end up with millions of tokens, most appearing rarely.

At the other extreme, character-level tokenization uses individual characters as tokens. "The" becomes ["T", "h", "e"]. Now your vocabulary is tiny (just letters and punctuation), but sequences become very long, and the model must learn to assemble characters into meaningful units from scratch.

Interactive: How Text Becomes Tokens

Input text:

Tokens (4):Each word is one token

Thetransformersareplaying

Word-level: Simple but creates huge vocabularies. Every typo, rare word, and name needs its own token.

Try:

The winning approach is subword tokenization, with methods like Byte Pair Encoding (BPE). The idea is clever: start with characters, then iteratively merge the most frequent pairs.

Consider "unhappiness." Instead of treating it as one rare word or seven characters, BPE might split it into ["un", "happi", "ness"]. Each piece is common enough to appear frequently in the training data. The model can recognize that "un" often means negation, "ness" often forms nouns, and combine these learned patterns.

This handles rare words gracefully. Even if the model has never seen "cryptocurrency," it might tokenize it as ["cry", "pt", "o", "currency"], reusing knowledge about "currency" and common letter patterns.

Modern models like GPT use vocabularies of 50,000 to 100,000 subword tokens—enough to represent almost any text while keeping sequences manageable.

Embeddings

Now comes the key insight: instead of representing each token as a single number, we represent it as a vector—a list of many numbers.

Imagine each token as a point in high-dimensional space. Similar tokens cluster near each other. "Cat" and "dog" are close because they're both animals. "Run" and "sprint" are close because they're both verbs of motion. "King" and "queen" are close because they're both royalty.

Interactive: Words in Embedding Space

Each word is a point in embedding space. Similar meanings cluster together. Hover over words to see their nearest neighbors.

This vector representation is called an embedding. When the model processes the word "cat," it doesn't see the number 2—it sees a list of perhaps 768 numbers that position "cat" in semantic space.

Here's the remarkable thing: these embeddings are learned. We don't manually decide that "cat" should be at position [0.3, -0.7, 0.2, ...]. Instead, the model learns embeddings during training by observing how words are used. Words that appear in similar contexts end up with similar vectors.

This is the distributional hypothesis at work: "You shall know a word by the company it keeps." The words around "cat" (pet, furry, meows, scratches) are similar to the words around "dog" (pet, furry, barks, wags), so their embeddings become similar.

Semantic Arithmetic

The embedding space has a useful property: it captures relationships as directions. The vector from "man" to "woman" points in roughly the same direction as the vector from "king" to "queen." Both capture the concept of gender.

This enables semantic arithmetic:

\text{king} - \text{man} + \text{woman} \approx \text{queen}

Take the embedding of "king," subtract the "maleness" direction (king → man), add the "femaleness" direction (→ woman), and you land near "queen." The geometry of the space encodes meaning.

Interactive: Semantic Arithmetic

king − man + woman ≈ queen

Show relationship vectors

The vector from man to king captures the "gender" relationship. Adding this vector to woman lands near queen.

These relationships emerge automatically from training—the model learns that analogous word pairs should differ by similar vectors.

This isn't just a parlor trick. It reveals that embeddings capture compositional structure. Relationships like capital-of, past-tense, and comparative forms often correspond to consistent directions in the embedding space.

The famous example "king - man + woman = queen" works because the embedding space has learned that the difference between king and queen is analogous to the difference between man and woman—both capturing gender while preserving the concept of royalty.

The Embedding Dimension

How many numbers should be in each embedding vector? This is the embedding dimension, and it's a crucial hyperparameter.

GPT-3 uses embeddings with 12,288 dimensions. Each token is represented by 12,288 numbers. That sounds like a lot—and it is. But consider what those dimensions must encode: every nuance of meaning, every grammatical property, every contextual possibility.

What do these dimensions represent? Unlike traditional features (height, weight, color), embedding dimensions don't have predefined meanings. They're learned abstractions. One dimension might loosely correspond to "animacy." Another might capture "formality." Most are complex combinations that defy simple labels.

The embedding layer is typically the first thing a token passes through. Raw token IDs (integers from 0 to vocabulary size) get transformed into dense embedding vectors. These vectors are what flow through the rest of the transformer, getting transformed and combined as the model builds understanding.

\text{token ID } \xrightarrow{\text{embedding lookup}} \text{ vector of dimension } d

Think of it as a translation layer: converting the discrete, symbolic world of text into the continuous, numerical world where neural networks operate.

Looking Ahead

We've solved the first challenge: turning text into numbers that preserve semantic relationships. But we haven't addressed word order yet. The embeddings for "The cat sat on the mat" and "The mat sat on the cat" would be identical so far—just different arrangements of the same vectors.

In the next chapter, we'll explore how transformers encode position information, allowing the model to distinguish where each token appears in the sequence.

Key Takeaways

Simple word numbering fails because numerical relationships (ordering, distance) don't match semantic relationships
Tokenization breaks text into discrete units—subword tokenization (like BPE) balances vocabulary size with sequence length
Embeddings represent tokens as vectors in high-dimensional space where similar meanings cluster together
Embeddings are learned from data, capturing the distributional structure of how words are used
Semantic arithmetic works because relationships like gender or tense correspond to consistent directions in embedding space
Modern models use embedding dimensions in the thousands, allowing rich representations of meaning