Locality-Sensitive Hashing

Random projections and hash collisions for approximate search

Locality-Sensitive Hashing (LSH) takes a radically different approach to approximate search. Instead of building a navigable graph, LSH uses hash functions designed so that similar items hash to the same bucket. Query by hashing, then search only within matching buckets.

The Core Idea

Normal hash functions aim for uniform distribution—similar inputs should produce dissimilar outputs. LSH inverts this: similar inputs should produce the same hash.

Interactive: LSH concept

LSH: design hash functions so similar items collide with high probability.

If similar vectors hash to the same bucket with high probability, we can:

Hash all vectors during indexing
Hash the query at search time
Search only vectors in matching buckets
Ignore vectors in other buckets

This gives sublinear search time without building an explicit graph structure.

Random Hyperplane Projections

For cosine similarity, the classic LSH method uses random hyperplanes.

A hyperplane in d dimensions divides the space into two half-spaces. Any vector falls on one side or the other. If we use the hyperplane's normal vector $\vec{r}$ , the sign of $\vec{v} \cdot \vec{r}$ tells us which side $\vec{v}$ falls on.

Interactive: Random projections

Hyperplane angle: 45°45

Side 1 (v·r ≥ 0)

Side 0 (v·r < 0)

Rotate the hyperplane to see how points get classified. Similar points tend to stay together.

The key property: vectors on the same side of a random hyperplane are more likely to be similar.

More precisely, the probability that two vectors end up on the same side is:

$P(\text{same side}) = 1 - \frac{\theta}{\pi}$

Where $\theta$ is the angle between the vectors. Small angle (similar vectors) → high probability of same side.

Building LSH Hash Tables

An LSH hash combines multiple random projections:

Choose k random hyperplanes
For each vector, compute which side of each hyperplane it falls on
Concatenate the k bits to form a hash code

Interactive: Hash buckets

Hash bits (k): 22

4 buckets from 2 hyperplanes

7 items

6 items

Trade-off: More bits (k) = smaller buckets = faster search, but higher chance that similar items end up in different buckets (false negatives).

With k hyperplanes, we get $2^k$ possible hash buckets. Similar vectors are likely to land in the same bucket because they agree on most hyperplane sides.

But k hyperplanes mean false negatives: two similar vectors might disagree on one hyperplane and end up in different buckets.

Multiple Hash Tables

To reduce false negatives, LSH uses L independent hash tables, each with different random hyperplanes.

At query time:

Hash the query with each of the L hash functions
Retrieve vectors from all matching buckets
Compute exact distances for retrieved candidates
Return the top k

More tables → fewer false negatives → better recall → more candidates to score.

LSH for Different Metrics

Different similarity measures require different hash functions:

Cosine similarity: Random hyperplane projection (described above)

Euclidean distance: Random projection with bucketing into intervals

Jaccard similarity: MinHash (for set similarity)

The key requirement: the hash function must satisfy the locality-sensitive property for your target metric.

Trade-offs and Parameters

Interactive: LSH trade-offs

k (bits per hash): 88

L (hash tables): 1010

Buckets per table

256

Recall

72%

Query time

39.1ms

Memory (1M vecs)

3.0GB

Parameter effects:

Higher k → smaller buckets, faster search, more false negatives
Higher L → better recall, more tables to check, more memory

k (bits per hash):

Higher k → fewer vectors per bucket → faster search
Higher k → more false negatives → lower recall
Memory scales as O(L × n × k)

L (number of tables):

Higher L → better recall → more buckets to check
Query time scales roughly as O(L × bucket_size)
Memory scales as O(L × n × k)

Tuning LSH requires balancing these parameters for your recall/latency/memory constraints.

LSH vs HNSW

Aspect	LSH	HNSW
Query time	O(L × bucket_size)	O(log n)
Build time	O(L × n × k)	O(n log n)
Memory	Multiple hash tables	Single graph
Updates	Fast (just hash)	Fast (local graph update)
Recall@same latency	Lower	Higher

HNSW generally outperforms LSH in practice for most retrieval workloads. LSH remains relevant for:

Streaming data (constant-time insertion)
Very high-dimensional sparse data
Theoretical guarantees on recall

Practical Considerations

Tuning is hard. Unlike HNSW where efSearch directly controls recall, LSH parameters interact in complex ways.

Memory overhead. Multiple hash tables can exceed the vector storage itself.

Implementation quality. Production LSH requires careful engineering (cache-friendly memory layout, SIMD hash computation).

Not the default choice. Most vector databases use HNSW or IVF as their primary index structure.

When to Consider LSH

You need theoretical worst-case guarantees
Your data is high-dimensional and sparse
You have streaming insertions and cannot rebuild indices
Memory is less constrained than latency

For most semantic search workloads, HNSW is the better choice.

Key Takeaways

LSH uses hash functions where similar items hash to the same bucket
Random hyperplane projections work for cosine similarity: probability of collision increases with similarity
Multiple hash tables (L) reduce false negatives at the cost of more bucket lookups
Parameters k (bits) and L (tables) control the recall/speed/memory trade-off
HNSW generally outperforms LSH for practical retrieval; LSH is more relevant for streaming or sparse data
LSH provides theoretical guarantees that graph-based methods lack