Manas HNSW Realtime: Powering Realtime Embedding-Based Retrieval

Pinterest Engineering
Pinterest Engineering Blog
9 min readJan 22, 2021

--

Tim Koh | Software Engineer, Core Product Serving Infra
George Wu | Software Engineer, Core Product Serving Infra
Michael Mi | Tech Lead, Core Product Serving Infra

In our previous blog post, we introduced our in-house search engine — Manas — and shared how we served term-based search at scale. Since launch, Manas has grown to become one of the key candidate generators in Pinterest, serving many use cases that expand beyond its initial purpose.

In particular, embedding-based retrieval is a key component of Pinterest’s discovery and recommendations engine. Manas has traditionally supported Approximate Nearest Neighbor (ANN) search via Locality-Sensitive Hashing (LSH) over the Inverted Index, a natural extension of a term based search engine. After the release of new state-of-the-art technologies like Hierarchical Navigable Small World graphs (HNSW), we’ve built a flexible embedding-based retrieval framework within Manas which allows us to easily onboard new ANN technologies. We launched HNSW to our batch indexing clusters (ranging from minutes to days of indexing delay) using the new framework with huge savings in serving cost and latency reductions compared to LSH.

The next milestone in our plan was to launch HNSW to our realtime streaming clusters (seconds level indexing delay). It is not a simple task to serve HNSW in real time, at scale, in-part because we are forging new ground without being able to rely on any open source implementation.

In this blog, we will share our journey to serving HNSW in realtime — our approach to solving this problem, the challenges we faced, and some optimizations we made to productionize the system.

Manas Realtime

The essence of the project was to build realtime components for HNSW and integrate them into Manas Realtime. To better understand how these components fit into the bigger picture, let’s take a brief look at the high level architecture of Manas Realtime.

Manas Realtime is essentially an LSM engine, which converts random IO writes into sequential IO writes. Instead of exposing a write endpoint, writes are ingested from Kafka, allowing us to simplify the system and rely on Kafka as a WAL. There are three types of writes, and here is how they are handled:

  1. New documents are written to the in-memory Realtime Segment, which is eventually sealed and flushed to an on-disk Static Segment
  2. Deletions are applied using an in-memory marker and filtered out during serving
  3. Updates are done by deleting the old doc and adding a new one

A background compaction process occasionally combines various Static Segments to reduce the serving overhead of having too many segments. We also rely on the compaction process to perform the actual deletions by removing the documents from the index.

From a serving perspective, Manas Realtime isn’t too different from Manas Static. We have abstractions on the index so that the storage layer is transparent to the entire retrieval process. As such, with HNSW already launched for Manas Static, most of the serving components already exist. Our work was mainly integrating with the LSM indexing components of Manas Realtime. There are two core components we needed to build and optimize, which we will go over in detail in the sections below:

  1. Realtime HNSW Graph
  2. HNSW Graph Compaction

Realtime HNSW Graph

The Realtime Segment is the only mutable component in the system, hence optimizations in this area are critical to ensuring good concurrent read and write performance.

The HNSW index is essentially a multiple-layered sparse graph. We chose an adjacency list to represent the graph, where the key is the node id and the value is the list of neighbor ids. We started with a lock-based version with each node owning a lock that would be held by both readers and writers before updating the neighbor list. It is easy to implement and reason about. However, as a consequence of lock contention, the high system CPU usage leaves us no choice but to use a lock-free technique.

Lock-free Implementation

Let’s dissect a bit how we handle writes in an intuitive way. The HNSW idea originated from the well-known skip list structure. Thus, the lock-free implementation for HNSW is also similar to lock-free skip list. In general, in order to add a new node into a graph, two steps are involved for each layer, as shown in the following diagram.

  1. Find neighbors for the new node within the layer and connect the new node to selected neighbors
  2. Update selected neighbors to connect to the new node.

Likewise, we add new nodes starting from base layer to upper layers in the HNSW graph, to avoid scenarios where a new node gets selected as the enter point in the upper layer but no connections are actually built for it in the lower layers, leading to no result issues.

For deletions, we avoid the cost and complications of applying them to the graph in place. Instead, we handle them outside the graph with an in-memory deletion marker, relying on filters to filter out deleted nodes during serving.

Some detailed optimizations are worth mentioning briefly:

  • Single writer multiple readers: For simplicity, we continue the tradition of using single write multiple reads concurrency mode, resulting in neat and easy to reason about code.
  • Pre-allocated graph: Since the realtime graph usually is small with a fixed size, we pre-allocate memory for the graph to avoid complications introduced by resizing.
  • Customized neighbor selection algorithm: With the standard neighbor selection algorithm, for updating a neighbor list, there are three possibilities: adding one new neighbor, reducing neighbors and replacing one neighbor. When it comes to lock-free implementation, eliminating the ‘reducing neighbors’ scenario by backfilling closest neighbors actually simplifies the logic a lot, enabling us to just use atomic operators.
  • `Atomic` variables: c++ std::atomic variables actually are expensive even with release-acquire ordering. Instead, we use aligned memory to guarantee atomicity and a global atomic variable to serve as a memory barrier, enabling us to explicitly commit all changes for one node only once. It is still possible that some partial updates leak out to read threads, hurting global connectivity within a short time. Since no noticeable recall drop from observation, we treat it as a reasonable trade off between performance and quality.

HNSW Graph Compaction

The main problem we need to address with compaction is compaction speed. As mentioned previously, compaction is how we reduce the total number of segments that are served simultaneously. At best, a long compaction time results in higher CPU usage; at worst, the system halts ingestion, resulting in new updates not being reflected and served.

Clean Slate Merger

Our first attempt at a compaction algorithm for hnsw is what we call clean slate; essentially, the algorithm builds a completely new graph from the non deleted embeddings of all input segments. This method is too slow for some of our use cases, so we need to optimize the algorithm.

Add on Merger

Our next strategy is to reuse as much of the index as possible; we choose the largest segment out of all the ones to be compacted and convert the index into an in-memory structure we can reuse. The remaining embeddings from the other segments are then added onto the reused graph.

The remaining issue is how to handle the deleted embeddings from the reused segment. We tried two different methods: 1) persisting deletions and reselecting neighbors, and 2) grouping deleted embeddings with nearby alive embeddings. Although both options are suitable for clients, it turns out the first option was too slow in some scenarios.

Persisting Deletions

We need to maintain the small world property of the graph, and simply removing deleted nodes and their in/out edges could break the connectivity in the graph. To address this, we use a process called neighbor reselection, where nodes potentially connect to deleted nodes’ neighbors to maintain connectivity.

We found that if there was a large amount of deleted nodes, the compaction time actually ended up being slower than the clean slate algorithm, which is not ideal.

Grouping Deleted Nodes with their Closest Alive Nodes

There are two reasons why persisting deletions could be slower than using the clean slate algorithm.

  • We are backfilling the distances between the nodes and their neighbors in the reused segment, resulting in numerous expensive distance calculations.
  • The neighbor reselection process could be very expensive, especially if many nodes are deleted. This is because more reselection iterations are needed if a deleted node’s neighbors are also deleted.

Our second optimization is to group deleted nodes with nearby alive nodes, thereby avoiding the expensive reselection process. The original graph remains the same as before, but now multiple nodes map to the same embedding. Since the graph is unchanged, connectivity is maintained. In addition, we lazily compute the distances between nodes and their neighbors instead of proactively backfilling them, avoiding unnecessary distance computations. We also need to add a deduping step in the algorithm since multiple nodes can correspond to the same embedding.

Online Recall Monitoring

Up till now, we’ve focused on how we built and optimized the components in our system. But there is another very important aspect to productionizing a system — quality validation. For HNSW, recall is the metric we use to verify the quality of our index. It is computed by comparing the results of the approximate nearest neighbors (ANN) search to the ideal results returned by an exact nearest neighbors (KNN) search.

Monitoring recall is also especially important because some optimizations might involve a quality trade-off for better system performance. We need to track these quality degradations to ensure that we are still serving good results to our clients.

With an immutable set of embeddings, it’s relatively easy to calculate the recall for a given query. We can pre-compute the KNN using an offline batch job and compute the ANN by generating the index and issuing queries to it. Since the set of embeddings is constant, the KNN results never change, and we can tweak the index build parameters to optimize for recall.

However, in the realtime scenario, embeddings are constantly being added and deleted, making a pre-computed KNN set unusable. To address this, we developed an online recall tool; we added functionality in our serving cluster to compute both ANN and KNN results which allows us to calculate recall at a given point in time.

What’s Next

It has been an exciting journey for us to launch HNSW on our batch indexing clusters and to push the boundaries of our capabilities by enabling realtime serving for HNSW. But HNSW is only the first step in our vision for our embedding-based retrieval system.

Efficiency and Experimentation

We’ve built a system that does the heavy lifting for productionizing embedding-based retrieval, allowing our ML engineers to try out a new embedding or a new algorithm without having to build a new production system from scratch. We will continue to iterate on the system, improving things like serving performance, funnel efficiency, and facilitating easy experimentation.

Streaming Filtering

The current approach to filtering is to pre-fetch K ANNs from the HNSW graph, then apply filters to get our final candidate set. This isn’t very funnel efficient, and it can be hard to figure out what value of K will give us the number of final candidates we need. We plan to implement the HNSW algorithm in a streaming fashion where filters can be applied during the fetch, and the streaming fetch only terminates when we have retrieved the number of candidates we need.

Stay tuned!

Acknowledgements: The authors would like to thank the following people for their contributions: Zheng Liu, Roger Wang, Haibin Xie, Sheng Cheng, Wangfan Fu, Ning Zhang, Pihui Wei, Pong Eksombatchai, Andrew Zhai, Sam Liu, and Vijai Mohan.

--

--