Locality-Sensitive Hashing

Random projections and hash collisions for approximate search

Locality-Sensitive Hashing (LSH) takes a radically different approach to approximate search. Instead of building a navigable graph, LSH uses hash functions designed so that similar items hash to the same bucket. Query by hashing, then search only within matching buckets.

The Core Idea

Normal hash functions aim for uniform distribution—similar inputs should produce dissimilar outputs. LSH inverts this: similar inputs should produce the same hash.

Interactive: LSH concept

LSH: design hash functions so similar items collide with high probability.

If similar vectors hash to the same bucket with high probability, we can:

  1. Hash all vectors during indexing
  2. Hash the query at search time
  3. Search only vectors in matching buckets
  4. Ignore vectors in other buckets

This gives sublinear search time without building an explicit graph structure.

Random Hyperplane Projections

For cosine similarity, the classic LSH method uses random hyperplanes.

A hyperplane in d dimensions divides the space into two half-spaces. Any vector falls on one side or the other. If we use the hyperplane's normal vector r\vec{r}, the sign of vr\vec{v} \cdot \vec{r} tells us which side v\vec{v} falls on.

Interactive: Random projections

Side 1 (v·r ≥ 0)
Side 0 (v·r < 0)

Rotate the hyperplane to see how points get classified. Similar points tend to stay together.

The key property: vectors on the same side of a random hyperplane are more likely to be similar.

More precisely, the probability that two vectors end up on the same side is:

P(same side)=1θπP(\text{same side}) = 1 - \frac{\theta}{\pi}

Where θ\theta is the angle between the vectors. Small angle (similar vectors) → high probability of same side.

Building LSH Hash Tables

An LSH hash combines multiple random projections:

  1. Choose k random hyperplanes
  2. For each vector, compute which side of each hyperplane it falls on
  3. Concatenate the k bits to form a hash code

Interactive: Hash buckets

4 buckets from 2 hyperplanes
00
7 items
01
9 items
10
2 items
11
4 items

Trade-off: More bits (k) = smaller buckets = faster search, but higher chance that similar items end up in different buckets (false negatives).

With k hyperplanes, we get 2k2^k possible hash buckets. Similar vectors are likely to land in the same bucket because they agree on most hyperplane sides.

But k hyperplanes mean false negatives: two similar vectors might disagree on one hyperplane and end up in different buckets.

Multiple Hash Tables

To reduce false negatives, LSH uses L independent hash tables, each with different random hyperplanes.

At query time:

  1. Hash the query with each of the L hash functions
  2. Retrieve vectors from all matching buckets
  3. Compute exact distances for retrieved candidates
  4. Return the top k

More tables → fewer false negatives → better recall → more candidates to score.

LSH for Different Metrics

Different similarity measures require different hash functions:

Cosine similarity: Random hyperplane projection (described above)

Euclidean distance: Random projection with bucketing into intervals

Jaccard similarity: MinHash (for set similarity)

The key requirement: the hash function must satisfy the locality-sensitive property for your target metric.

Trade-offs and Parameters

Interactive: LSH trade-offs

Buckets per table
256
Recall
72%
Query time
39.1ms
Memory (1M vecs)
3.0GB
Parameter effects:
  • Higher k → smaller buckets, faster search, more false negatives
  • Higher L → better recall, more tables to check, more memory

k (bits per hash):

  • Higher k → fewer vectors per bucket → faster search
  • Higher k → more false negatives → lower recall
  • Memory scales as O(L × n × k)

L (number of tables):

  • Higher L → better recall → more buckets to check
  • Query time scales roughly as O(L × bucket_size)
  • Memory scales as O(L × n × k)

Tuning LSH requires balancing these parameters for your recall/latency/memory constraints.

LSH vs HNSW

AspectLSHHNSW
Query timeO(L × bucket_size)O(log n)
Build timeO(L × n × k)O(n log n)
MemoryMultiple hash tablesSingle graph
UpdatesFast (just hash)Fast (local graph update)
Recall@same latencyLowerHigher

HNSW generally outperforms LSH in practice for most retrieval workloads. LSH remains relevant for:

  • Streaming data (constant-time insertion)
  • Very high-dimensional sparse data
  • Theoretical guarantees on recall

Practical Considerations

Tuning is hard. Unlike HNSW where efSearch directly controls recall, LSH parameters interact in complex ways.

Memory overhead. Multiple hash tables can exceed the vector storage itself.

Implementation quality. Production LSH requires careful engineering (cache-friendly memory layout, SIMD hash computation).

Not the default choice. Most vector databases use HNSW or IVF as their primary index structure.

When to Consider LSH

  • You need theoretical worst-case guarantees
  • Your data is high-dimensional and sparse
  • You have streaming insertions and cannot rebuild indices
  • Memory is less constrained than latency

For most semantic search workloads, HNSW is the better choice.

Key Takeaways

  • LSH uses hash functions where similar items hash to the same bucket
  • Random hyperplane projections work for cosine similarity: probability of collision increases with similarity
  • Multiple hash tables (L) reduce false negatives at the cost of more bucket lookups
  • Parameters k (bits) and L (tables) control the recall/speed/memory trade-off
  • HNSW generally outperforms LSH for practical retrieval; LSH is more relevant for streaming or sparse data
  • LSH provides theoretical guarantees that graph-based methods lack