Tokenization Internals
BPE, WordPiece, and how subword tokenization affects embeddings
Before text becomes embeddings, it becomes tokens. This transformation is more consequential than it appears. The tokenization strategy affects what the model can represent, how it handles rare words, and even how much computation it uses.
Why Not Words?
Word-level tokenization seems natural: split on whitespace and punctuation. But it has serious problems.
Vocabulary explosion. English has 170,000+ words in common use. Add technical terms, names, and misspellings, and the vocabulary is unbounded.
Out-of-vocabulary (OOV) problem. Words not seen during training have no representation. "ChatGPT" in 2020 would be unknown to any model trained before it.
No morphological insight. "Walk," "walks," "walked," "walking" are separate tokens. The model must independently learn they are related.
Interactive: Tokenize some text
Try: "unbelievable", "tokenization", "defenestration", or made-up words
Subword tokenization solves these issues. Instead of words, we use pieces—characters, common substrings, and full words for frequent terms.
Byte Pair Encoding (BPE)
BPE, originally a compression algorithm, is the foundation of modern tokenization. The idea: iteratively merge the most frequent pairs of symbols.
Building the Vocabulary
Start with a vocabulary of individual characters. Then:
- Count all pairs of adjacent symbols in the corpus
- Find the most frequent pair (e.g., "t" + "h")
- Merge that pair into a new symbol ("th")
- Repeat until vocabulary reaches target size
Interactive: BPE merge process
After 30,000 merges on a large corpus, you get a vocabulary where:
- Common words are single tokens ("the", "and", "of")
- Moderate words are 2-3 tokens ("running" → "run" + "ning")
- Rare words decompose to characters ("xyzzy" → "x" + "y" + "z" + "z" + "y")
Tokenizing New Text
To tokenize with a trained BPE vocabulary:
- Start with characters
- Apply merges in order learned
- Stop when no more merges apply
"unbelievable" might become: ["un", "believ", "able"]
The model sees familiar pieces even for novel words.
WordPiece
WordPiece, used by BERT, is similar to BPE but uses a different merge criterion.
Instead of frequency, WordPiece merges the pair that maximizes likelihood of the training corpus. The formula compares:
Pairs that increase corpus probability get merged. This produces slightly different vocabularies than BPE, often with more linguistically meaningful units.
WordPiece tokens are marked with "##" when they continue a word:
"embedding" → ["em", "##bed", "##ding"]
The "##" tells the model this is not word-initial.
SentencePiece
SentencePiece treats the input as a raw byte stream, not pre-tokenized words. This enables:
- Language-agnostic tokenization
- Handling any Unicode
- No whitespace assumptions
GPT models use SentencePiece (or variants) for this flexibility.
Vocabulary Size Trade-offs
Interactive: Vocabulary size effects
Larger vocabularies mean fewer tokens per word but more memory and risk of undertrained tokens.
Smaller vocabulary (8K-16K):
- More tokens per text (slower)
- Better OOV handling
- Less memory for embeddings
- May split common words awkwardly
Larger vocabulary (32K-100K):
- Fewer tokens (faster)
- More words as single tokens
- More memory for embedding matrix
- Rare tokens have few training examples
Most modern models use 30K-50K tokens. This balances coverage with embedding quality.
How Tokenization Affects Embeddings
Tokenization has downstream effects:
Sequence length. More tokens = longer sequences = more computation. A 512-token limit might cover 400 words or 250 words depending on tokenization.
Semantic granularity. Subword tokens learn meaning from their occurrences. "un-" appearing in "unhappy," "uncertain," "undo" learns negation. This is compositional understanding.
Rare word handling. A rare word decomposed into common subwords inherits meaning from those pieces. "Defenestration" → "de" + "fen" + "est" + "ration" has partial meaning from parts.
Interactive: OOV handling
Combines familiar parts
Key insight: Subword tokenization never truly fails—unknown words decompose into known pieces. The model can infer partial meaning from familiar subwords like "un-", "-able", "-tion".
Tokenization Gotchas
Numbers. "1234" might be one token, or four, or something in between. Models handle numbers inconsistently.
Whitespace. Leading spaces are often included in tokens. " hello" differs from "hello" in many tokenizers.
Case. Some models are case-sensitive, others lowercase everything. "Bank" and "bank" might be same or different tokens.
Language. Tokenizers trained on English struggle with other scripts. Chinese might produce one token per character. Arabic gets broken oddly.
Practical Implications
For semantic search:
Chunk wisely. Understand your tokenizer's behavior when setting chunk sizes. A 512-token chunk might be 300-600 words.
Test edge cases. Rare terms, technical jargon, code, and non-English text might tokenize unexpectedly.
Consider the model's tokenizer. When choosing embedding models, their tokenizer affects what text they handle well.
Measure actual token counts. Don't assume. Run text through the tokenizer to see actual token counts.
Key Takeaways
- Subword tokenization (BPE, WordPiece) solves vocabulary explosion and OOV problems
- BPE iteratively merges frequent pairs; vocabulary is learned from corpus statistics
- Common words become single tokens; rare words decompose to subwords
- Subword units can learn compositional meaning ("un-" learns negation)
- Vocabulary size trades off sequence length, memory, and embedding quality
- Tokenization affects sequence length, which affects compute and context limits
- Test your tokenizer on actual data—edge cases can surprise you