Sentence Embeddings
Pooling strategies, contrastive learning, and what makes a good retrieval embedding
Transformers produce one embedding per token. For semantic search, we need one embedding per passage. This transformation—from token-level to sequence-level—is called pooling. But pooling alone is not enough. The model must be trained specifically for the task of measuring semantic similarity.
Pooling Strategies
A transformer processing "The quick brown fox" produces four token embeddings. We need to reduce them to one vector.
Interactive: Compare pooling strategies
| Token | Dim 1 | Dim 2 | Dim 3 | Dim 4 |
|---|---|---|---|---|
| The | 0.20 | 0.80 | 0.30 | 0.50 |
| quick | 0.90 | 0.30 | 0.70 | 0.20 |
| brown | 0.40 | 0.60 | 0.80 | 0.40 |
| fox | 0.70 | 0.40 | 0.50 | 0.90 |
| Mean | 0.55 | 0.53 | 0.57 | 0.50 |
Mean pooling averages all tokens equally. Most common for sentence transformers.
CLS Pooling
BERT adds a special [CLS] token at the start of every sequence. After processing, the [CLS] embedding supposedly represents the whole sequence.
In practice, CLS was designed for classification, not similarity. Without fine-tuning, CLS embeddings are poor for retrieval. The token was trained to predict class labels, not to encode semantic content for comparison.
Mean Pooling
Average all token embeddings:
Simple, effective, and what most sentence transformer models use. Every token contributes equally, capturing information from across the sequence.
Max Pooling
Take the element-wise maximum:
Captures the strongest signal per dimension. Works well when specific words carry key meaning.
Weighted Pooling
Weight tokens by attention or learned importance:
More sophisticated, but requires deciding how to compute weights. Some models use the attention from [CLS] as weights.
Why Pre-trained BERT Fails
You might expect: train BERT on massive corpora, apply mean pooling, done. But BERT embeddings, without fine-tuning for similarity, perform worse than random on some tasks.
The issue: BERT's training objectives (masked language modeling, next sentence prediction) optimize for reconstruction, not comparison. The embedding space has structure useful for prediction, but semantically similar sentences are not necessarily nearby.
This is why sentence transformer models (SBERT, E5, GTE) add a fine-tuning stage specifically for semantic similarity.
Contrastive Learning
The key innovation: train embeddings such that similar passages are close and dissimilar passages are far apart.
Interactive: Contrastive learning dynamics
Click Train to see how contrastive learning shapes the embedding space.
The Training Setup
Training proceeds in batches. Each batch contains queries paired with positive passages—documents we know are relevant to those queries. The batch also includes negative passages, either sampled randomly or chosen as "hard negatives" that are similar but not correct. The model computes embeddings for all of these, then updates its weights to maximize similarity between queries and their positives while minimizing similarity to negatives.
Contrastive Loss Functions
InfoNCE (common):
The temperature controls sharpness. Lower temperature makes the model more decisive about positives vs negatives.
Multiple Negatives Ranking Loss:
Treats all other positives in the batch as negatives. Efficient—you get O(batch_size) negatives per query for free.
Hard Negatives
Random negatives are too easy. The model quickly learns to distinguish "The cat sat on the mat" from "Quantum computing applications in cryptography."
Hard negatives are passages that are similar but not the true positive. They might share topic, words, or structure. Training against hard negatives forces the model to make finer distinctions.
Where do hard negatives come from? One source is BM25 retrieval—passages that are lexically similar but semantically different. Another source is previously retrieved false positives from the model itself. A third source is other passages from the same document, which share context but answer different questions.
What Makes a Good Retrieval Embedding?
Embedding quality dimensions
Semantic Coherence
Similar meanings produce similar vectors
Paraphrases cluster together
Unrelated texts mixed with related
Semantic Coherence
Similar meanings → similar vectors. Paraphrases should be nearby. Unrelated text should be far.
Appropriate Granularity
The embedding should capture the information at the right level. Sentence embeddings should not collapse differences between similar sentences.
Uniform Utilization
Use the full embedding space. If all embeddings cluster in one region, you waste capacity.
Anisotropy Awareness
Pre-trained models often have anisotropic embedding spaces—vectors cluster in a cone rather than distributing uniformly. Fine-tuning for retrieval helps but does not fully solve this.
Asymmetric Retrieval
Queries and documents are fundamentally different. Queries are short questions. Documents are longer passages with detailed information. Should they share the same embedding?
Symmetric vs asymmetric embeddings
Query
"What causes rain?"
Document
"Rain forms when water vapor..."
Symmetric Encoding
Same encoder for both. Simple, but queries and documents have different characteristics (length, style, vocabulary). The embedding space must somehow accommodate both.
Symmetric Models
Same encoder for queries and documents. Simple, but might not handle the length/style mismatch well.
Asymmetric Models
Separate encoders, or prefixes that signal "this is a query" vs "this is a document." Models like E5 use "query:" and "passage:" prefixes.
The asymmetric approach often improves retrieval. Queries are compressed questions while documents are expanded answers—they have fundamentally different structure. With asymmetric encoding, the embedding space can learn this mapping explicitly rather than forcing both into the same representation. This also handles vocabulary mismatch better: queries use question words while documents use declarative language.
Training Data Matters
The quality of sentence embeddings depends heavily on training data.
Natural Language Inference (NLI) provides pairs of sentences labeled as entailment, contradiction, or neutral. This teaches the model to distinguish semantic similarity from semantic opposition. Paraphrase data—sentences with the same meaning but different words—teaches that surface form differs from meaning. Question-answer pairs, where queries are matched to relevant passages, directly train the retrieval task. Synthetic data generated by models like GPT can create query-passage pairs at massive scale.
Modern embedding models combine multiple data sources. E5 trains on a mixture of 12 or more datasets. GTE uses carefully curated multitask data. The diversity matters: models trained on varied data generalize better to new domains.
Evaluating Sentence Embeddings
Two standard benchmarks dominate evaluation. MTEB (Massive Text Embedding Benchmark) covers 56 datasets across 8 tasks including retrieval, classification, and clustering. BEIR focuses specifically on retrieval with 18 datasets across diverse domains, testing zero-shot transfer to new domains the model has never seen.
Good retrieval performance means relevant passages rank highly (measured by nDCG and MRR), most relevant passages appear in the top results (measured by recall@k), and the model works across domains without domain-specific fine-tuning.
Key Takeaways
- Pooling transforms token embeddings into a single sequence embedding; mean pooling is most common
- Pre-trained BERT embeddings are poor for similarity without fine-tuning
- Contrastive learning trains embeddings by pushing similar pairs together and dissimilar pairs apart
- Hard negatives force the model to make fine-grained distinctions
- Asymmetric models handle the query-document mismatch with separate encoders or prefixes
- Training data diversity and quality are critical for embedding performance
- Evaluate on retrieval-specific benchmarks like MTEB and BEIR