Tokenization Internals

BPE, WordPiece, and how subword tokenization affects embeddings

Before text becomes embeddings, it becomes tokens. This transformation is more consequential than it appears. The tokenization strategy affects what the model can represent, how it handles rare words, and even how much computation it uses.

Why Not Words?

Word-level tokenization seems natural: split on whitespace and punctuation. But it has serious problems.

Vocabulary explosion. English has 170,000+ words in common use. Add technical terms, names, and misspellings, and the vocabulary is unbounded.

Out-of-vocabulary (OOV) problem. Words not seen during training have no representation. "ChatGPT" in 2020 would be unknown to any model trained before it.

No morphological insight. "Walk," "walks," "walked," "walking" are separate tokens. The model must independently learn they are related.

Interactive: Tokenize some text

Enter text to tokenize:

Tokens (6):

thequickbrownfoxjump##s

Full word token

Continuation (##)

Character fallback

Try: "unbelievable", "tokenization", "defenestration", or made-up words

Subword tokenization solves these issues. Instead of words, we use pieces—characters, common substrings, and full words for frequent terms.

Byte Pair Encoding (BPE)

BPE, originally a compression algorithm, is the foundation of modern tokenization. The idea: iteratively merge the most frequent pairs of symbols.

Building the Vocabulary

Start with a vocabulary of individual characters. Then:

Count all pairs of adjacent symbols in the corpus
Find the most frequent pair (e.g., "t" + "h")
Merge that pair into a new symbol ("th")
Repeat until vocabulary reaches target size

Interactive: BPE merge process

Step 1 of 5

Current corpus representation:

lower▁lowest▁newer

Most frequent pairs:

e r(4)l o(2)o w(2)

After 30,000 merges on a large corpus, you get a vocabulary where:

Common words are single tokens ("the", "and", "of")
Moderate words are 2-3 tokens ("running" → "run" + "ning")
Rare words decompose to characters ("xyzzy" → "x" + "y" + "z" + "z" + "y")

Tokenizing New Text

To tokenize with a trained BPE vocabulary:

Start with characters
Apply merges in order learned
Stop when no more merges apply

"unbelievable" might become: ["un", "believ", "able"]

The model sees familiar pieces even for novel words.

WordPiece

WordPiece, used by BERT, is similar to BPE but uses a different merge criterion.

Instead of frequency, WordPiece merges the pair that maximizes likelihood of the training corpus. The formula compares:

$\frac{P(\text{merged})}{\prod P(\text{pieces})}$

Pairs that increase corpus probability get merged. This produces slightly different vocabularies than BPE, often with more linguistically meaningful units.

WordPiece tokens are marked with "##" when they continue a word:

"embedding" → ["em", "##bed", "##ding"]

The "##" tells the model this is not word-initial.

SentencePiece

SentencePiece treats the input as a raw byte stream, not pre-tokenized words. This enables:

Language-agnostic tokenization
Handling any Unicode
No whitespace assumptions

GPT models use SentencePiece (or variants) for this flexibility.

Vocabulary Size Trade-offs

Interactive: Vocabulary size effects

Vocabulary size: 32,00032000

Avg tokens per word

1.4

Whole words in vocab

28,000

Embedding memory

192 MB

Rare token risk

1.5%

Larger vocabularies mean fewer tokens per word but more memory and risk of undertrained tokens.

Smaller vocabulary (8K-16K):

More tokens per text (slower)
Better OOV handling
Less memory for embeddings
May split common words awkwardly

Larger vocabulary (32K-100K):

Fewer tokens (faster)
More words as single tokens
More memory for embedding matrix
Rare tokens have few training examples

Most modern models use 30K-50K tokens. This balances coverage with embedding quality.

How Tokenization Affects Embeddings

Tokenization has downstream effects:

Sequence length. More tokens = longer sequences = more computation. A 512-token limit might cover 400 words or 250 words depending on tokenization.

Semantic granularity. Subword tokens learn meaning from their occurrences. "un-" appearing in "unhappy," "uncertain," "undo" learns negation. This is compositional understanding.

Rare word handling. A rare word decomposed into common subwords inherits meaning from those pieces. "Defenestration" → "de" + "fen" + "est" + "ration" has partial meaning from parts.

Interactive: OOV handling

cryptocurrency

cry##pto##currency

Combines familiar parts

Key insight: Subword tokenization never truly fails—unknown words decompose into known pieces. The model can infer partial meaning from familiar subwords like "un-", "-able", "-tion".

Tokenization Gotchas

Numbers. "1234" might be one token, or four, or something in between. Models handle numbers inconsistently.

Whitespace. Leading spaces are often included in tokens. " hello" differs from "hello" in many tokenizers.

Case. Some models are case-sensitive, others lowercase everything. "Bank" and "bank" might be same or different tokens.

Language. Tokenizers trained on English struggle with other scripts. Chinese might produce one token per character. Arabic gets broken oddly.

Practical Implications

For semantic search:

Chunk wisely. Understand your tokenizer's behavior when setting chunk sizes. A 512-token chunk might be 300-600 words.

Test edge cases. Rare terms, technical jargon, code, and non-English text might tokenize unexpectedly.

Consider the model's tokenizer. When choosing embedding models, their tokenizer affects what text they handle well.

Measure actual token counts. Don't assume. Run text through the tokenizer to see actual token counts.

Key Takeaways

Subword tokenization (BPE, WordPiece) solves vocabulary explosion and OOV problems
BPE iteratively merges frequent pairs; vocabulary is learned from corpus statistics
Common words become single tokens; rare words decompose to subwords
Subword units can learn compositional meaning ("un-" learns negation)
Vocabulary size trades off sequence length, memory, and embedding quality
Tokenization affects sequence length, which affects compute and context limits
Test your tokenizer on actual data—edge cases can surprise you