Manas HNSW Streaming Filters

Tim Koh | Software Engineer, Core Product Serving Infra; George Wu | (former) Software Engineer, Core Product Serving Infra

Introduction

Embedding-based retrieval is a core center piece of our recommendations engine at Pinterest. We support a myriad of use cases, from retrieval based on content similarity to learned retrieval. It’s powered by our in-house search engine — Manas — which provides Approximate Nearest Neighbor (ANN) search as a service, primarily using Hierarchical Navigable Small World graphs (HNSW).

While traditional token-based search retrieves documents on term matching on a tree of terms with logical connectives like ANDs and ORs, ANN search retrieves based on embedding similarity. Oftentimes we’d like to do a hybrid search query that combines the two. For example, “find similar products to this pair of shoes that are less than $100, rated 4 stars or more, and ship to the UK.” This is a common problem, and it’s not entirely unsolved, but the solutions each have their own caveats and trade-offs.

Existing Solutions

Our previous approach was post-filtering, essentially performing the ANN search first followed by a token based search restricted to the result set. Post-filtering suffers from funnel efficiency, and we used an overfetch to work around this problem. However, this is not scalable as clients need to continuously tune their overfetch, and every request can have a different filter rate.

Pre Filtering

An alternative approach is pre-filtering. First, we figure out the set of docs that match the token-based filters, either during indexing time or by evaluating the token search query first. Then we perform the ANN search while filtering out docs that do not exist in this set. However, the indexing time approach is hard to generalize to arbitrary tree filters; and pre-evaluating the token search query could work well for a simple set of filters or a small corpus of documents, but we have use cases that don’t fall into either category. Even traditional search without ANN often searches through only a tiny subset of our large corpuses due to early termination.

Our Solution

Each approach has its merits, and depending on the scenario, they might even be the most ideal way to solve the problem. We are a generic platform serving a myriad of use cases at Pinterest, each with a different corpus size, query complexity, and filter conditions. Because of this, we chose a generic approach of applying filters during the HNSW graph traversal in a streaming fashion. We make no assumptions about the use case while still providing a way to build on this framework and apply optimizations as needed (e.g. pre-evaluation can be added as a preprocessing step to constructing filters).

Overview

Before: Query is expressed as a tree with HNSW prefetch performed at the leaf, reducing the hybrid query to a traditional search query. After: HNSW is extracted out of the leaf into an Iterator that can stream results approximately ordered by distance. The rest of the tree is applied as a filter on these results.

The above diagram summarizes how an ANN query is processed in our system before and after the streaming changes. There’s a few notable points:

HNSW changes from a batch prefetch in the query parsing stage to a streaming fetch in the query execution stage.
Query execution changes from retrieving docs in doc_id order to retrieving them in approximate distance order. This is a problem to tackle because, as a search engine, our index formats are optimized for doc_id order.
The query structure remains unchanged, providing backward compatibility and a seamless migration.
Lightweight scoring has been decoupled from being executed in the iterator tree. This isn’t important for HNSW streaming, but it is aligned with our direction to generalize scoring away from our tree-based linear combination approach.

There are also a few principles that influenced our design, which could be helpful to call out:

Modularity: ANN retrieval, filtering, and scoring should all be decoupled from each other.
Minimal Changes: Build and launch quickly by reusing existing components as much as possible and optimize later as needed.
Backward Compatibility: Clients should be able to onboard with minimal changes to their request.
Forward Compatibility: Interfaces should be generic, and each component (e.g. filters index format) should be easily upgradable.

Hopefully this section gives a good high level overview of the system components and why we’ve structured things this way. To get a deeper understanding of how everything works there’s two blackboxes we need to open 1) the streaming algorithm, and 2) how filters work. Let’s dive in!

Streaming Algorithm

The streaming algorithm is actually quite simple at a high level: we fetch some candidates, apply our filters, score, add the candidates into a result heap, and then repeat. The diagram below shows this at the high level.

Fetch Candidates -> Apply Filters -> Score -> Add to Result Heap This is repeated until we hit the stopping condition.

Here are some things we considered during implementation:

Initially we designed the streaming process to retrieve one candidate at a time, but we quickly realized that the roundtrip fetch/filter/score would not be efficient, so we switched to using mini-batches. We then needed to decide what mini-batch size to use. HNSW actually stores a list of every node’s neighbors, so we used the list of neighbors as the mini-batch.
In order to continue the stream, we needed to store some state from the internal HNSW algorithm. Since we used the list of neighbors as the mini-batch, we only stored the candidates we already processed (visited list) and the candidates we still needed to process (candidate set).
Lastly, we had to figure out when to stop the streaming search. This deserves its own section, which we’ll address next.

Stopping Conditions

Stepping back a bit, if we look at the original HNSW paper, the algorithm doesn’t terminate when we retrieve enough candidates; rather it terminates when the candidates we’ve accumulated are all closer than the closest candidate in our candidate set. The main intuition behind this is to make sure that the algorithm retrieves the best (closest) candidates with high probability. We applied the same concept in the streaming search, with the key difference being that we were operating on post-filtered candidates only.

Time Budget

In high filter rate scenarios, we could end up traversing the whole graph and still not find enough candidates, causing extremely high latency. Since most clients have a latency requirement, we used a time budget to cap the time the streaming search took. Once we hit the budget, we would return the candidates that we had accumulated.

Filters

The way we’ve designed filtering is highly influenced by a few of the principles listed above: modularity and forward compatibility. The simplest way to implement filtering is to add code directly to the HNSW code. In fact, this is already done for deletion markers in the open sourced HNSW code. However, this breaks modularity and isn’t ideal for maintainability and forward compatibility of the filter code. This is especially important to us because we serve many clients with many different filter requirements.

We designed the interfaces to not assume any underlying filter structure or storage format. And we implemented support for the main use case, where clients can specify arbitrary filter trees in the request, expressed with conjunction and disjunction connectives.

In the spirit of minimal changes, we reused the inverted index as the filter store. So essentially we have a filter tree backed by postinglists at the leaves, very similar in structure to the iterator tree we use in token-based search. This is convenient for reuse but inefficient because the inverted index is optimized for doc_id ordered iteration, but HNSW streaming requires non-ordered point-wise lookup. We worked around this by using bitmap and array-backed postinglists instead of skiplist-backed postinglists, trading off memory efficiency for compute efficiency. This does introduce a clear scalability challenge: with a large number of filters we simply cannot afford the memory cost, but this is nothing major that we need to address in the near term. We have future work planned to upgrade the filter store.

Optimizations

In some of our client use cases, the filter tree is extremely complex, resulting in the filter stage taking up the most latency. One optimization is to skip candidates that have worse distance than the ones in our result heap when it is full to avoid filtering on candidates we would not select anyways.

Init with Batch

Instead of streaming from the beginning, we first retrieve a batch size equal to the number of candidates the client wants, since we need to retrieve at least that much initially.

Reorder Filter Tree Nodes

Since streaming does non-ordered point-wise lookups, the ordering of the filter tree nodes becomes important, as evaluating the strictest filters first will be more efficient.

Future Work

The key thing to note above is that the current streaming approach does not actually reduce the number of candidates needed for retrieval, it only figures out the proper overfetch automatically for each request. Each filtered candidate is still a wasted distance computation.

We’re currently experimenting with partitioning the space into separate subgraphs by larger filters, e.g. US or non-US. This works well for our use cases with a few large filters. A more scalable extension could be to label the graph with filters and allow a traversal over a disjunction or conjunction of labels.

Efficient Filter Store

Using the inverted index as a filter store works well for certain scenarios, but it’s really optimized for traditional search and not for filtering on graph traversal. We could design a filter store from the ground up optimized for graph-based filtering and share this with our other graph based retrieval systems like Pixie.

Quantization

Extremely high filter scenarios are solvable with brute force, but there’s still a spectrum of cases that have a very high filter rate but would be expensive with brute force. The bottleneck for these cases is a lot of wasted distance computation. This cost can be greatly reduced with quantization. We could either move to different algorithms like PQ IVF or introduce PQ with HNSW.

Conclusion

We have implemented streaming filtering, which abstracts away implementation details of how filtering is executed and relieves the client from the burden of overfetch tuning. From the system point of view, we have a generic filter solution that is flexible enough to support all of our use cases and can support future optimizations like pre-filtering and filter store upgrades. We’ve already seen big cost savings and quality improvements just by removing imprecise overfetch tuning, and we have learned many opportunities for future optimizations.

Stay tuned!

Acknowledgements: The authors would like to thank the following people for their contributions: Ahmed Thabet, Apoorv Sharma, Hari Venkatesan, Haibin Xie, Leo Lu, Michael Mi, Pak Ming Cheung, Peifeng Yin, Pihui Wei, Roger Wang, Sheng Cheng, Supeng Ge, Van Lam, Zheng Liu, Zihao Fan

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog, and visit our Pinterest Labs site. To view and apply to open opportunities, visit our Careers page