Quantization
Product quantization and how to compress vectors 32x without losing accuracy
A 768-dimensional float32 embedding uses 3KB of memory. At a billion vectors, that is 3TB. Memory is expensive, and cache efficiency matters for speed. Quantization compresses vectors, trading precision for space and often improving performance.
Scalar Quantization
The simplest approach: reduce precision per dimension.
Interactive: Scalar quantization
float32 → float16: 2x compression, minimal accuracy loss. Widely supported in GPUs and modern CPUs.
float32 → int8: 4x compression. Quantize each dimension to 8 bits using learned scale and zero-point. Accuracy loss is small for well-trained embeddings.
float32 → binary: 32x compression. Each dimension becomes 1 bit (sign only). Significant accuracy loss, but enables extremely fast Hamming distance computations.
Scalar quantization is orthogonal to index structure—you can combine it with HNSW or any other algorithm.
Product Quantization
Product quantization (PQ) achieves higher compression by exploiting structure in the embedding space.
Interactive: Product quantization
The idea:
- Split the 768-dimensional vector into M subvectors (e.g., 96 subvectors of 8 dimensions each)
- For each subvector, learn a codebook of K centroids (e.g., K=256)
- Replace each subvector with the index of its nearest centroid
With M=96 subspaces and K=256 centroids:
- Original size: 768 × 4 = 3072 bytes
- Compressed size: 96 × 1 = 96 bytes (each index fits in 1 byte)
- Compression ratio: 32x
At query time, distances are computed approximately using precomputed distance tables.
How PQ Distance Computation Works
For each query:
- Compute distances from the query's subvectors to all centroids in each codebook
- Store these in a lookup table (M × K values)
- For each database vector, sum up the precomputed distances for its centroid indices
This is surprisingly fast:
- The lookup table fits in L1 cache
- Distance computation becomes M table lookups and additions
- SIMD can parallelize across database vectors
IVF-PQ: Combining Clustering with Quantization
Most vector databases use IVF-PQ:
-
Coarse quantization (IVF): Cluster vectors into buckets. At query time, search only nearby buckets.
-
Fine quantization (PQ): Within each bucket, store PQ-compressed residuals (the difference between each vector and its bucket centroid).
This provides both fast candidate filtering (IVF) and memory efficiency (PQ).
Optimized Product Quantization (OPQ)
Standard PQ assumes the dimensions are independent. In practice, they are correlated. OPQ learns a rotation matrix to decorrelate dimensions before quantization.
This improves accuracy at no additional storage cost—the rotation is applied at query time.
Trade-offs
Interactive: Quantization trade-offs
| Method | Compression | Recall Loss | Use Case |
|---|---|---|---|
| float32 | 1x | 0% | Baseline, small datasets |
| float16 | 2x | <1% | GPU inference |
| int8 | 4x | 1-3% | CPU inference, moderate scale |
| PQ (M=96) | 32x | 5-10% | Billion-scale datasets |
| Binary | 32x | 15-25% | Extreme scale, first-pass filtering |
The choice depends on your scale and accuracy requirements.
Quantization-Aware Retrieval
Modern approaches train embedding models with quantization in mind:
Straight-through estimators: Backpropagate through quantization during training.
Matryoshka embeddings: Train embeddings where truncated versions (first 256 dims, first 128, etc.) remain effective. Use full embeddings for reranking.
Binary embedding models: Train specifically for binary quantization, optimizing for Hamming distance.
Compression Comparison
Compare compression methods
| Method | Compression | Memory (1B vecs) | Recall Loss |
|---|---|---|---|
| float32 | 1x | 3.0 TB | ~0% |
| float16 | 2x | 1.5 TB | ~0.5% |
| int8 | 4x | 750 GB | ~3% |
| PQ-96 | 32x | 94 GB | ~8% |
| Binary | 32x | 94 GB | ~20% |
Pro tip: Use PQ for first-pass retrieval, then rerank top-100 against full-precision vectors. This recovers most of the recall loss while keeping memory efficient.
For 1 billion 768-dimensional vectors:
| Method | Memory | Recall@10 |
|---|---|---|
| float32 | 3 TB | 100% |
| float16 | 1.5 TB | 99.5% |
| int8 | 750 GB | 97% |
| PQ (32x) | 94 GB | 92% |
| PQ + rerank | 94 GB + 300 GB | 98% |
The PQ + rerank pattern: store PQ-compressed vectors for fast first-pass search, then rerank top candidates against full-precision vectors stored on disk.
Practical Recommendations
-
Start with int8: 4x compression with minimal accuracy loss. Widely supported.
-
Use PQ for billions of vectors: When memory is the constraint, PQ enables billion-scale search on commodity hardware.
-
Combine with reranking: Retrieve 100 candidates with PQ, rerank with full-precision vectors for the final top 10.
-
Benchmark on your data: Compression effects vary by embedding model and domain.
-
Consider Matryoshka embeddings: If your model supports it, truncation provides smooth compression/accuracy trade-off.
Key Takeaways
- Scalar quantization reduces precision per dimension: float32 → int8 gives 4x compression
- Product quantization splits vectors into subvectors, replaces each with a codebook index: up to 32x compression
- IVF-PQ combines clustering (fast candidate filtering) with PQ (memory efficiency)
- Trade-off: higher compression → lower recall → faster search
- PQ + reranking recovers accuracy: fast first-pass with compressed vectors, precise rerank with full vectors
- Modern embeddings are trained with quantization in mind (Matryoshka, binary-aware)