Quantization

Product quantization and how to compress vectors 32x without losing accuracy

A 768-dimensional float32 embedding uses 3KB of memory. At a billion vectors, that is 3TB. Memory is expensive, and cache efficiency matters for speed. Quantization compresses vectors, trading precision for space and often improving performance.

Scalar Quantization

The simplest approach: reduce precision per dimension.

Interactive: Scalar quantization

Original

0.7342

32 bits

Quantized

0.7342

32 bits

Error

0.000000

0% compression

Per-dimension memory

32/32 bits

float32 → float16: 2x compression, minimal accuracy loss. Widely supported in GPUs and modern CPUs.

float32 → int8: 4x compression. Quantize each dimension to 8 bits using learned scale and zero-point. Accuracy loss is small for well-trained embeddings.

float32 → binary: 32x compression. Each dimension becomes 1 bit (sign only). Significant accuracy loss, but enables extremely fast Hamming distance computations.

Scalar quantization is orthogonal to index structure—you can combine it with HNSW or any other algorithm.

Product Quantization

Product quantization (PQ) achieves higher compression by exploiting structure in the embedding space.

Interactive: Product quantization

Subspaces (M): 88

Codebook size (K): 256256

Vector decomposition

96d

8 subspaces × 96 dimensions each

Original size

3,072 bytes

Compressed size

8 bytes

Compression ratio

384.0x

How it works: Each subspace gets a codebook of 256 centroids. Each subvector is replaced by its nearest centroid's index (8 bits).

The idea:

Split the 768-dimensional vector into M subvectors (e.g., 96 subvectors of 8 dimensions each)
For each subvector, learn a codebook of K centroids (e.g., K=256)
Replace each subvector with the index of its nearest centroid

With M=96 subspaces and K=256 centroids:

Original size: 768 × 4 = 3072 bytes
Compressed size: 96 × 1 = 96 bytes (each index fits in 1 byte)
Compression ratio: 32x

At query time, distances are computed approximately using precomputed distance tables.

How PQ Distance Computation Works

For each query:

Compute distances from the query's subvectors to all centroids in each codebook
Store these in a lookup table (M × K values)
For each database vector, sum up the precomputed distances for its centroid indices

This is surprisingly fast:

The lookup table fits in L1 cache
Distance computation becomes M table lookups and additions
SIMD can parallelize across database vectors

IVF-PQ: Combining Clustering with Quantization

Most vector databases use IVF-PQ:

Coarse quantization (IVF): Cluster vectors into buckets. At query time, search only nearby buckets.
Fine quantization (PQ): Within each bucket, store PQ-compressed residuals (the difference between each vector and its bucket centroid).

This provides both fast candidate filtering (IVF) and memory efficiency (PQ).

Optimized Product Quantization (OPQ)

Standard PQ assumes the dimensions are independent. In practice, they are correlated. OPQ learns a rotation matrix to decorrelate dimensions before quantization.

This improves accuracy at no additional storage cost—the rotation is applied at query time.

Trade-offs

Interactive: Quantization trade-offs

Compression ratio: 4x4

Recall@10

97.0%

Memory (1B vecs)

750 GB

Search speedup

1.5x

Recall

Memory

Method	Compression	Recall Loss	Use Case
float32	1x	0%	Baseline, small datasets
float16	2x	<1%	GPU inference
int8	4x	1-3%	CPU inference, moderate scale
PQ (M=96)	32x	5-10%	Billion-scale datasets
Binary	32x	15-25%	Extreme scale, first-pass filtering

The choice depends on your scale and accuracy requirements.

Quantization-Aware Retrieval

Modern approaches train embedding models with quantization in mind:

Straight-through estimators: Backpropagate through quantization during training.

Matryoshka embeddings: Train embeddings where truncated versions (first 256 dims, first 128, etc.) remain effective. Use full embeddings for reranking.

Binary embedding models: Train specifically for binary quantization, optimizing for Hamming distance.

Compression Comparison

Compare compression methods

Method	Compression	Memory (1B vecs)	Recall Loss
float32	1x	3.0 TB	~0%
float16	2x	1.5 TB	~0.5%
int8	4x	750 GB	~3%
PQ-96	32x	94 GB	~8%
Binary	32x	94 GB	~20%

Pro tip: Use PQ for first-pass retrieval, then rerank top-100 against full-precision vectors. This recovers most of the recall loss while keeping memory efficient.

For 1 billion 768-dimensional vectors:

Method	Memory	Recall@10
float32	3 TB	100%
float16	1.5 TB	99.5%
int8	750 GB	97%
PQ (32x)	94 GB	92%
PQ + rerank	94 GB + 300 GB	98%

The PQ + rerank pattern: store PQ-compressed vectors for fast first-pass search, then rerank top candidates against full-precision vectors stored on disk.

Practical Recommendations

Start with int8: 4x compression with minimal accuracy loss. Widely supported.
Use PQ for billions of vectors: When memory is the constraint, PQ enables billion-scale search on commodity hardware.
Combine with reranking: Retrieve 100 candidates with PQ, rerank with full-precision vectors for the final top 10.
Benchmark on your data: Compression effects vary by embedding model and domain.
Consider Matryoshka embeddings: If your model supports it, truncation provides smooth compression/accuracy trade-off.

Key Takeaways

Scalar quantization reduces precision per dimension: float32 → int8 gives 4x compression
Product quantization splits vectors into subvectors, replaces each with a codebook index: up to 32x compression
IVF-PQ combines clustering (fast candidate filtering) with PQ (memory efficiency)
Trade-off: higher compression → lower recall → faster search
PQ + reranking recovers accuracy: fast first-pass with compressed vectors, precise rerank with full vectors
Modern embeddings are trained with quantization in mind (Matryoshka, binary-aware)