Advancements in Embedding-Based Retrieval at Pinterest Homefeed

Zhibo Fan | Machine Learning Engineer, Homefeed Candidate Generation; Bowen Deng | Machine Learning Engineer, Homefeed Candidate Generation; Hedi Xia | Machine Learning Engineer, Homefeed Candidate Generation; Yuke Yan | Machine Learning Engineer, Homefeed Candidate Generation; Hongtao Lin | Machine Learning Engineer, ATG Applied Science; Haoyu Chen | Machine Learning Engineer, ATG Applied Science; Dafang He | Machine Learning Engineer, Homefeed Relevance; Jay Adams | Principal Engineer, Pinner Curation & Growth; Raymond Hsu | Engineering Manager, Homefeed CG Product Enablement; James Li | Engineering Manager, Homefeed Candidate Generation; Dylan Wang | Engineering Manager, Homefeed Relevance

Introduction

At Pinterest Homefeed, embedding-based retrieval (a.k.a Learned Retrieval) is a key candidate generator to retrieve highly personalized, engaging, and diverse content to fulfill various user intents and enable multiple actionability, such as Pin saving and shopping. We have introduced the establishment of this two-tower model with its modeling basics and serving details. In this blog, we will focus on the improvements we made on embedding-based retrieval: how we scale up with advanced feature crossing and ID embeddings, upgrading the serving corpus, and our current journey to machine learning based retrieval revolution with state-of-the-art modeling.

Feature Crossing

We have various features provided to the model in the hope that it can reveal the latent pattern for user engagements, ranging from pretrained embedding features to categorical or numerical features. All these features are converted to dense representations through embedding or Multi-layer perceptron (MLP) layers. A prior knowledge of recommendation tasks is that incorporating more feature crossing may benefit model performance. For example, knowing the combination of movie author and genre provides more context than having these features alone.

The common philosophy for two-tower models is modeling simplicity; however, it’s more about having no user-item feature interaction and using simple similarity metrics like dot-product. Because the Pin tower is used offline and the user tower is only fetched once during a homefeed request, we can scale up to a complicated model structure within each tower. All the following structures are applied to both towers.

Our first attempt is to upgrade the model with MaskNet[1] for bitwise feature crossing. This process is different from the original paper: after embedding layer normalization and concatenation, our MaskNet block is implemented as the Hadamard product of the input embedding and a projection of itself via a two-layer MLP, followed by another two-layer MLP to refine the representation. We parallelize four such blocks with a bottleneck-style MLP. This setup simplifies the model architecture and brings high learnability with extensive feature crossing inside each tower. At Pinterest Homefeed, we use engaged sessions to measure the impact of recommendation system iterations, which are continuous interaction sessions larger than 60 seconds. This model architecture upgrade improved 0.15–0.35% engaged sessions across Pinterest.

Figure 1. Two-Tower Model with Parallel Mask Net[1] as Feature Crossing

We further upscale the architecture to the DHEN[2] framework, which ensembles multiple different feature crossing layers in both serial and parallel ways. We juxtapose an MLP layer with the same parallel mask net and append another layer of juxtaposition of an MLP and a transformer encoder[3]. This appended layer enhances field-wise interaction since the attention is applied at field level, while the dot-product based feature crossing is at bit level for MaskNet. This scaling up brings another +0.1–0.2% engaged sessions, together with >1% homefeed saves and clicks.

Figure 2. Model Scaling with DHEN[2] Framework and Transformers[3] for Field-wise Crossing

Adopting Pre-trained ID Embeddings

Industry recommendation shows the benefit of having ID embeddings by memorizing user engagement patterns. At Pinterest, to overcome the well-known ID embedding overfitting issue and maximize ROI and flexibility in downstream ML models, we pre-train large-scale user and Pin ID embeddings by contrastive learning on sampled negatives over a cross-surface large window dataset with no positive engagement downsampling [7]. It brings great ID coverage and rich semantics tailored for recommendations at Pinterest. We adopt this large ID embedding table in the retrieval model to enhance the precision. At training time, we use the recently released torchrec library to implement and shared the large pin ID table across GPUs. We serve the CPU model artifact due to the loose latency requirement for offline inference.

However, although the training objectives for the two models are similar (i.e., contrastive learning over sampled negatives), directly fine-tuning the embeddings does perform well online. We found that the model suffered from overfitting severely. To mitigate this, we first fixed the embedding table and applied an aggressive dropout with dropout probability of 0.5 on top of the ID embeddings, which led to decent online gains (0.6–1.2% HF repins and clicks increase). Later, we found it is not optimal to simply use the latest pretrained ID embedding, as the overlap between cotraining window and model training window can worsen overfitting. We ended up choosing the latest ID embedding without overlap, providing 0.25–0.35% HF repins increase.

Serving Corpus Upgrade

Apart from model upgrades, we also renovate our serving corpus as it defines the upper-limit of retrieval performance. Our initial corpus setup was to individualize Pins based on their canonical image signature, then include Pins with the most accumulated engagements in the last 90 days. To better capture the trends at Pinterest, instead of directly summing over the engagements, we switch to a time decayed summation to determine the score of a Pin p at date d as:

In addition, we also figured out a discrepancy in image signature granularity between training data and serving corpus. Serving corpus operates on a more coarse granularity to deduplicate similar contents and reduce indexing size; however, it will cause statistical features drifting, such as Pin engagements because the looked up image signature is different compared to training data. Closing this gap with a dedicated image signature remapping logic plus the time decay heuristics, we achieved +0.1–0.2% engaged sessions without any modeling changes.

Revolutionize Embedding Based Retrieval

In this section, we will briefly showcase our recent journey to bootstrapping the impact of embedding-based retrieval with state-of-the-art modeling techniques.

Multi-Embedding Retrieval

Different from other surfaces, homefeed has users entering with diverse intents, and it can be inadequate to represent all sorts of intents by a single embedding. With extensive experiments, we found that a differentiable clustering module modified upon Capsule Networks[4][5] performs better than other variants such as multi-head attention and pre-clustering based methods. We switched the cluster initialization with maxmin initialization[6] to speed up clustering convergence, and enforce single-assignment routing where each history item can only contribute to one cluster’s embedding to enhance diversification. We combine each of the cluster embeddings with other user features to generate multiple embeddings.

Figure 3. Left: Multi-Embedding Retrieval Model Structure. Right: Visualization results for a random user, every 2 Pins belong to the same user embedding.

At serving time, we only keep the first K embeddings and run ANN search, and K is determined by the length of user history. Thanks to the property of maxmin initialization, the first K embeddings are generally the most representative ones. Then the results are combined in a round robin fashion and passed to the ranking and blending layers. This new user sequence modeling technique not only hones diversity of the system but also helps increase users’ save actions, indicating that users refine their inspiration on homefeed.

Conditional Retrieval for Homefeed

At Pinterest, a great source of diversity comes from the interest feed candidate generator, a token-based search according to users’ explicit followed interests and inferred interests. These explicit interest signals may provide us with auxiliary information on the user’s intentions beyond user engagement history. However, due to lack of finer-grained personalization among the matched candidates, they tend to have lower engagement rate.

We utilized conditional retrieval[8], a two-tower model with a conditional input to boost personalization and engagements: at training time, we feed the target Pin’s interest id and embed it as the condition input to the user tower; when we serve the model, we feed users’ followed and inferred interests as the conditional input to fetch the candidates. The model follows an early-fusion paradigm that the conditional interest input is fed into the model at the same layer as all other features. Surprisingly, the model can learn to condition its output and produce highly relevant results, even among the long-tail interests. We further equipped the ANN search with interest filters to guarantee high relevance between the query interest and the retrieved candidates. Having better personalization and engagements at the retrieval stage helps improve recommendation funnel efficiency and improves user engagements significantly.

Figure 4. Left: Conditional Retrieval Model Structure. Right: Visualization results for a random user, the top figure shows retrieved candidates for iced coffee, a popular interest, and the bottom one shows retrieved candidates of friendship bracelets, which is a fine-grained tail interest.

Acknowledgment

This blog represents a variety of workstreams on embedding-based retrieval across many teams at Pinterest. We want to thank them for the valuable support and collaboration.

Home Relevance: Dafang He, Alok Malik

ATG: Yi-Ping Hsu

PADS: Lily Ling

Pinner Curation & Growth: Jay Adams

Reference

[1] Wang, Zhiqiang, Qingyun She, and Junlin Zhang. “Masknet: Introducing feature-wise multiplication to CTR ranking models by instance-guided mask.” arXiv preprint arXiv:2102.07619 (2021).

[2] Zhang, Buyun, et al. “DHEN: A deep and hierarchical ensemble network for large-scale click-through rate prediction.” arXiv preprint arXiv:2203.11014 (2022).

[3] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).

[4] Sabour, Sara, Nicholas Frosst, and Geoffrey E. Hinton. “Dynamic routing between capsules.” Advances in neural information processing systems 30 (2017).

[5] Li, Chao, et al. “Multi-interest network with dynamic routing for recommendation at Tmall.” Proceedings of the 28th ACM international conference on information and knowledge management. 2019.

[6] Arthur, David, and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. Stanford, 2006.

[7] Hsu, Yi-Ping, et al. “Taming the One-Epoch Phenomenon in Online Recommendation System by Two-stage Contrastive ID Pre-training.” Proceedings of the 18th ACM Conference on Recommender Systems. 2024.

[8] Lin, Hongtao, et al. “Bootstrapping Conditional Retrieval for User-to-Item Recommendations.” Proceedings of the 18th ACM Conference on Recommender Systems. 2024.