Manas Two-stage Retrieval — The efficient architecture for hierarchical documents

Published in

Pinterest Engineering Blog

9 min readJan 29, 2021

Haibin Xie | Software Engineer, Core Product Serving Infra
Michael Mi | Tech Lead, Core Product Serving Infra

As more use cases are onboarded to Manas, one special scalability and efficiency challenge emerges when serving documents with a hierarchical structure. Manas, as a traditional search engine, was designed and optimized to support flattened documents. As a result, we have to flatten attributes of root documents all the way to the leaf level regardless of the hierarchical structure, leading to inefficiencies in both indexing and serving pipelines.

Let’s take a closer look at one concrete scenario for a better understanding of the hierarchical structure. The campaign structure, as shown in the below diagram, is organized in three levels: Campaign, Ad group, and Promoted Pin. Campaigns contain Ad groups, and each Ad group consists of a collection of Promoted Pins.

To serve such hierarchical datasets, for example, we need to flatten all the properties of the Campaign and Ad group to the Pin level — the leaf node in the above diagram. In summary, with the flattened index, we are facing two major challenges:

High serving cost: The flattened index structure imposes a linear scaling factor between corpus size and serving cost.
High indexing update volume: Whenever any Ad group property gets changed, we need to fan out the change to the Pin level, which turns out to be expensive. In addition, this huge spiky update volume in the indexing pipeline introduces a burden to the system’s stability and maintenance cost.

Two-stage query execution was introduced to address those challenges. It supports queries that require multiple round-trips and allows clients to query for a set of matches and then use those matches to construct and execute a new query. You can tell from the name that each request is now executed in two stages, with each stage querying a separate index from a different Manas cluster, as shown in the following diagram.

Targeting match: query Ad group index for targeting match
Item selection: fetch pins belonging to matched groups from separate Pin index

This optimization turned out to be very successful, resulting in a huge serving cost reduction.

However, the additional overhead of multiple round-trips can’t be neglected. In addition, client code is complex since clients need to deal with messy aggregation and query construction for multiple stages. We are challenging ourselves to further improve funnel efficiency and simplify the serving/indexing stack by performing expensive operations lower in the serving stack and pushing two-stage query execution all the way to Manas leaf.

In this blog, we’ll share how we built the generic two-stage retrieval framework in Manas to support efficient retrieval for hierarchical documents.

Data Model

Let’s start with the data model of Manas, then discuss how we tailored it to support hierarchical documents.

Flattened Index

Documents are the unit of indexing and serving. A document contains a set of fields where each field represents one property of the document. We build an inverted index for each field and leverage skip lists to allow queries to efficiently jump over documents. When it comes to query execution, we scan documents one by one. Although with the help of skip lists — we can quickly skip some non-matches — the overall cost is still related to the number of scanned documents or corpus size.

The dataset has a hierarchical structure with each document belonging to one group, as shown in the following diagram. For this specific example, field3 and field4 are properties at the group level, meanwhile, field1 and field2 are properties at the document level.

Index Normalization

It is a waste to duplicate group-level properties. Similar to relational databases, we use data normalization to reduce data redundancy by moving the duplicated fields into a separate index.

As a result, we are able to reduce the index size and simplify the indexing pipeline. However, the query becomes more complex since we need to scan two tables instead of just one. We need two queries in order to achieve the same functionality, where the outer query depends on the results from the inner query, as shown in the following example.

Intuitively, the inner query becomes lightweight since it needs to scan much fewer documents in the Group index. Let’s dissect a bit why the outer query also becomes cheaper. As you might know, a posting list is a set of document ids. In order to be able to efficiently compute intersections and unions, the document ids in the posting list are sorted. Skip list is a commonly used data structure that allows O(log n) search complexity within a posting list. With the help of skip lists, we are able to quickly skip documents which are not going to be candidates. Although the skip list is efficient, the accumulated cost for large datasets with frequent skip calls can’t be neglected. Since we only need to retrieve documents within matched groups, we sort documents by group ids so that documents within the same group are placed together. As a result, we are able to skip directly to the matched groups and scan only the documents within these groups. In this way, the actual query cost is only related to the number of matches instead of corpus size.

Index Denormalization

When it comes to partitioning, the data model works well for one:one relationships where each document belongs to one single group and we just simply put the document into the group where the document belongs to. But it falls apart when one document belongs to multiple groups. In this case we need to decide which group this document should belong to. For instance, as shown in the following example, doc3 belongs to group1 and group2, putting doc3 into only group1 or only group3 may cause problems, leading to missing document issues in groups without doc3. Although querying documents from other groups may resolve the missing document issues, it introduces complexity and additional overhead.

The solution is index denormalization, adding redundant copies for those groups the document belongs to. We pay additional fanout cost in indexing time for better serving time performance. For now, from observation, the duplication ratio is still low and affordable for our scenarios. Besides, we need to use a different primary key to distinguish copies across groups.

Index Partitioning

This data model is generic, which can be applied to any use cases with hierarchical documents. We can follow the pattern to generate indexes with different primary keys for Group Index and Doc Index:

Group Index primary key: <group_id
Doc Index primary key: <doc_id + group_id>

As long as both indexes are partitioned by group id, each shard is self-contained. During query execution time, we don’t need to communicate with all shards to get complete matches.

Detailed Design

Now that we have shared the data model, let’s continue the journey of understanding how we incorporate it into Manas. In this section, we will cover several key technical details.

Namespace

Based on the data model, each Manas leaf needs to serve two indexes: Group Index and Doc Index. Namespace was employed to support the notion of multi-tenancy.

Namespace is a logical concept, which enables one Manas leaf to logically serve multiple indexes in a lightweight way. A unique namespace id is assigned to each index. Indexes across namespaces are logically isolated and can be updated and searched independently without impacting each other. We prefix each token/term in the index with the namespace id, which allows us to efficiently query a subset of the inverted index that belongs to the namespace.

SQuery Template

In Manas we use SQuery, a tree-like structure, to represent the request. The SQuery can be used to define the behavior of how to retrieve and score candidates from the index. Without index separation, it is straightforward to represent the query in SQuery format, as shown in the following example.

Taking the example of the two-stage retrieval case, it is trivial to represent the inner query in SQuery. But it is hard to represent the outer query since the outer query depends on results from the inner query, as shown in the following figure. Without executing the inner query against an actual index, it is impossible to write the actual SQuery for the outer query since SQuery requires all terms are known beforehand. Existing SQuery functionalities can’t support new requirements.

In order to address this challenge, the SQuery template was introduced to enable a dynamic structure. We use template nodes to define SQuery structure together with the feed-in values. During query parsing time, we expand the SQuery based on the feed-in values. In addition, feed-in values can be from another SQuery, which would be executed lazily.

Back to the two-stage retrieval case, the outer query is the top SQuery, while the inner query becomes a template node with the values from another SQuery, the inner query, as shown in the following graph.

With an actual index, the above SQuery template node would be expanded to the following SQuery structure.

Putting it Together

To summarize, SQuery Templates let us construct an outer query based on the output of an inner query. We can use this to perform a two-stage retrieval, with the first stage inner query hitting the group index to narrow down the search to the relevant matched groups, and the second stage outer query leveraging this narrowed search space for huge efficiency gains.

The different stage indexes are of course logically separated using our Namespace concept. And with index denormalization and proper partitioning, each shard is self contained and can handle queries in parallel without any cross communication with other shards.

We pieced together many different concepts and moving parts to support two-stage retrieval in Manas, but the efficiency gains we saw made it well worth the investment.

What’s Next

It has been an exciting journey for us to launch the two-stage retrieval framework for Ads and Home feed scenarios. But it is just a start. We will be continuously working on following areas:

Realtime Serving

We’ve built the two-stage retrieval for static clusters. When it comes to Realtime support, a mutable index adds extra difficulty for us. We are working on a solution to support two-stage retrieval for realtime clusters in an efficient, reusable, and extendable way.

Efficiency

Efficiency is always a critical factor for our system. We are continuing the effort by exploring more strategies for partitioning and normalization to improve both indexing and serving efficiency.

Acknowledgements: The authors would like to thank the following people for their contributions: Collins Chung, Tao Yang, Ang Xu, Jiacheng Hong, Dumitru Daniliuc, Zhihuang Chen, Tim Koh, George Wu, Sheng Cheng, Zheng Liu, Roger Wang, Chengcheng Hu and Karen Chau.