Filtering and Metadata

Pre-filtering, post-filtering, and hybrid approaches

Real-world search is rarely pure similarity. You want documents from the last 30 days, in English, from a specific tenant. These metadata filters interact with vector search in non-obvious ways. The approach matters for both correctness and performance.

The Filtering Problem

Consider: "Find the 10 most similar documents, but only from tenant_id = 'acme'."

The naive approach: retrieve the top 10 by similarity, then filter. But what if the top 100 results are all from other tenants? You return fewer than 10 results—or none.

Interactive: Filtering approaches

Vectors searched
34
Results returned
10/10
Wasted computation
23

Post-filter: With 30% selectivity, most results pass. Works well here.

Vector databases must balance:

  • Correctness: Return exactly k results matching the filter
  • Efficiency: Don't scan the entire database
  • Quality: Return the most similar results

Pre-filtering vs Post-filtering

Pre-filtering: Apply metadata filters first, then search only within matching vectors.

Pros:

  • Guarantees k results (if enough match)
  • Semantically correct

Cons:

  • Must build filter-specific indices
  • Performance degrades with selective filters

Post-filtering: Search all vectors, then filter results.

Pros:

  • Simple to implement
  • Uses standard vector index

Cons:

  • May return fewer than k results
  • Wastes computation on filtered-out results

Filter Selectivity

The key factor is selectivity: what fraction of vectors pass the filter?

Interactive: Filter selectivity effects

Category filter

5 categories, ~20% each

Selectivity
20.00%
Post-filter fetch needed
50 vectors
Feasible

Rule of thumb: If selectivity is below 1%, post-filtering becomes impractical. Pre-filtering or partitioning is required.

High selectivity (>10% pass): Post-filtering works well. Retrieve 10k candidates, filter, keep top k.

Low selectivity (<1% pass): Pre-filtering is essential. Post-filtering would require retrieving far too many candidates.

Very low selectivity (<0.01%): Specialized indices needed (e.g., per-tenant partitions).

Hybrid Filtering

Modern vector databases use hybrid approaches:

Hybrid filtering strategies

Adaptive

Choose pre/post based on estimated selectivity

Pros
  • + Best of both worlds
  • + Automatic optimization
Cons
  • - Requires selectivity estimation
  • - More complex implementation

Adaptive strategy: Choose pre- vs post-filtering based on estimated selectivity.

Filtered HNSW: Modify graph traversal to skip non-matching nodes. Requires filter-aware index.

Partitioned indices: Build separate vector indices per partition (e.g., per tenant). Query only relevant partitions.

Bitmap intersection: Maintain bitmaps for metadata values. Intersect with vector search candidates.

Metadata Index Design

Metadata indexing strategies

FieldIndex TypeCardinalityOperations
tenant_idHash IndexLow= only
categoryB-treeLow-Medium=, IN
timestampRange TreeHigh<, >, BETWEEN
tagsInverted IndexMediumCONTAINS
priceB-treeHigh<, >, =

Index selection guide

  • Equality filters: Hash or B-tree index
  • Range filters: B-tree or specialized range tree
  • Set membership: Inverted index for tags/keywords
  • High cardinality + equality: Consider partitioning instead

Efficient filtering requires indexing metadata:

Scalar fields (tenant_id, category):

  • B-tree or hash indices for equality
  • Fast lookup, supports compound filters

Numeric ranges (timestamp, price):

  • Range tree or sorted index
  • Efficient for "greater than" / "less than"

Text fields (tags, keywords):

  • Inverted index
  • Supports "contains" and keyword matching

High-cardinality fields:

  • Consider partitioning instead of filtering
  • Bitmap indices for low cardinality

Common Patterns

Multi-tenant Search

Each tenant's data should be isolated. Options:

  1. Separate indices per tenant: Strongest isolation, highest overhead
  2. Partition key in index: Single index, query filtered by tenant
  3. Post-filter with tenant check: Simple but wastes computation

Choose based on tenant count and size distribution.

Time-windowed Search

"Documents from the last 7 days" is common. Options:

  1. Timestamp filter: Works if selectivity is reasonable
  2. Rolling partitions: Daily/weekly indices, query recent partitions
  3. TTL with index rebuild: Automatically expire old vectors

Category Filtering

"Only category = 'electronics'" with dozens of categories. Options:

  1. Pre-filter with category bitmap: Fast for any selectivity
  2. Per-category indices: If categories are large and stable
  3. Hybrid: Use vector search, filter, oversample if needed

Performance Considerations

Filter pushdown: Apply filters as early as possible in the query plan.

Index intersection: Combine vector search candidates with metadata filter results.

Caching: Cache frequent filter results (tenant bitmaps, category lists).

Partition pruning: Skip entire index partitions that cannot match.

Over-fetching: Retrieve more candidates than needed to ensure k results after filtering.

When to Partition vs Filter

FactorPartitionFilter
CardinalityLow (<100)Any
Data size per partitionLargeAny
Query patternsUsually single partitionMixed
Update frequencyStableFrequent changes
Isolation requirementsStrict (multi-tenant)Relaxed

Partitioning adds complexity but enables:

  • Independent scaling per partition
  • Strict data isolation
  • Simplified filtering (no runtime check)

Key Takeaways

  • Post-filtering is simple but fails with selective filters (returns fewer than k results)
  • Pre-filtering guarantees k results but requires filter-aware indices
  • Filter selectivity determines which approach works: high selectivity → post-filter, low → pre-filter
  • Hybrid approaches adapt based on estimated selectivity
  • Index metadata fields appropriately: B-trees for scalars, range trees for numbers, inverted indices for text
  • Consider partitioning for multi-tenant or time-series data with clear access patterns
  • Over-fetch to compensate for post-filtering; push filters down early for performance