User Action Sequence Modeling for Pinterest Ads Engagement Modeling

Pinterest Engineering
Pinterest Engineering Blog
10 min readMar 5, 2024

--

Yulin Lei | Senior Machine Learning Engineer; Kaili Zhang | Staff Machine Learning Engineer; Sharare Zahtabian | Machine Learning Engineer II; Randy Carlson | Machine Learning Engineer I; Qifei Shen | Senior Staff Machine Learning Engineer

Introduction

Pinterest strives to deliver high-quality ads and maintain a positive user experience. The platform aims to show ads that align with the user’s interests and intentions, while also providing them with inspiration and discovery. The Ads Engagement Modeling team at Pinterest plays a crucial role in delivering effective advertising campaigns and helping businesses reach their target audience in a meaningful way. The goal of the engagement modeling is to show users the most relevant and engaging ads based on their interests and preferences. To deliver a personalized and enjoyable ad experience for its users, the Engagement Modeling team built deep neural network (DNN) models to continuously learn and adapt to user feedback and behavior, ensuring that the ads shown are highly targeted and valuable to the user.

Personalized recommendation is critical in the ads recommendation system because it can better capture users’ interests, connect the users with the compelling products, and keep them engaged with the platform. To make the ads Click-through rate (CTR) predictions more personalized, our team has adopted users’ real time behavior histories and applied deep learning algorithms to recommend appropriate ads to users.

In this blog post, we will mainly discuss how we adopt the user sequence features and the followup optimization:

  • Designed the sequence features
  • Leveraged Transformer for sequence modeling
  • Improved the serving efficiency by half precision inference

We will also share how to improve the model stability by Resilient Batch Norm.

Realtime User Sequence Features

To help the engagement models learn users’ feedback and interests, we developed user sequence features, which included users’ real time and historical engagement events and the related information. We defined sequence features from two main aspects: feature types and feature attributes.

Feature Types: Usually users interact with organic content or promoted Pins, which both indicate users’ intent and interest. Organic pins reflect users’ general interests, while promoted pins reflect users’ interests on sales, products, and so on. So we created two user sequence features: one with all engaged pins, and one with ads only. It turned out that both sequence features had sizable gains in terms of offline model performance. We also developed user search sequence features, which are also very informative and useful, especially for search ads.

Feature Attributes: Besides what sequence features to build, it is also important to focus on what to include in the sequence. A sequence of user activity is a popular design choice, and our user sequence is essentially a sequence of user-engaged event representations including timestamps, item representation, id features, and taxonomy features. At Pinterest, a pre-trained embedding (GraphSage) is commonly used for item representation in many models. We also use it as the item representation in our sequence features.

Sequence Modeling Architecture

Once we have the user sequences, in order to develop effective sequence modeling techniques, we explore a range of architectures.

Transformer [1]: One widely used approach is the Transformer, which serves as our baseline. We start with a single layer single head Transformer and include position embeddings based on time delta for each event in the sequence. We find that increasing the number of layers results in improved performance, while increasing the number of heads does not provide additional gains.

Figure 1: Transformer Architecture

Feature Connection: We also experiment with different methods for connecting features within each event, such as concatenation and sum. Both approaches prove effective in certain scenarios. The advantage of the sum connection is that it allows us to control the dimensionality of each event, making the computation of self-attention in the Transformer faster when using a small fixed dimension.

More Feature Interaction: A general practice when using Transformer on modeling user sequence is to first embed the entire sequence into a vector, then use this vector to interact with other features. However, early stage feature interaction is essential for ranking models. Thus, we introduce more feature interactions between the entire sequence with user and pin side representations. We calculate the cosine similarity between additional features with each event and use them as attributes for events. We also incorporate the user and pin side representations directly into the self-attention calculations.

Sum Pooling: In terms of pooling techniques, we experiment with sum pooling, which is traditionally used in user sequence modeling due to its efficiency. We also develop a new approach called period sum pooling, where we divide the sequence into multiple periods and apply sum pooling to each period. The results are then concatenated to generate the final representation of the sequence. In some scenarios, period sum pooling outperforms the Transformer baseline.

Figure 2: Sum Pooling

Deep Interest Network (DIN) [2]: Although we also explore the DIN, a popular architecture introduced in 2018, we find that it does not surpass the performance of the previously mentioned models.

Long-Short Interest: Recognizing that users’ long-term and short-term interests may differ, we separately model both aspects. The comprehensive sequence represents the long-term interests, while the latest eight events are considered the short-term interests. For the short-term sequences, we apply a lightweight attention mechanism similar to DIN. This allows us to capture users’ latest interests changes while still considering their longer-term patterns.

Figure 3: Long-Short Interest Module

Overall, by combining different architectures in various online production models, we achieve significant performance improvements in all scenarios.

Serving Efficiency: Mixed Precision Inference

The new architecture has more modules and larger layers, making it cost more to serve. While there are many opportunities for optimization, one of the most notable ones is mixed precision inference.

The GPUs we use for serving have tensor cores. Tensor cores are specialized in one thing: fused matrix multiply and add, but only with certain data types. Our current models use the pytorch default float32 datatype, but tensor cores don’t operate on this. To get an inference speedup, we need to use a lower-precision data type, of which pytorch offers two easy options: float16 and bfloat16. Both of these data types use 16 bits instead of 32 to represent a number, but they have different tradeoffs between range and precision. Float16 has a balanced reduction in both range and precision, whereas bfloat16 has nearly the same range as float32, but much-reduced precision. We had to find which of these data types has better performance in our model and make sure that it is stable.

Due to both 16-bit types having lower precision, we want to keep as much of our model as possible in float32 so as to not risk prediction quality, but we still want to get good reductions in inference time. We found that most of the largest layers had room for improvement, while a lot of the smaller layers didn’t affect inference time enough to make a difference.

For those larger layers, we tried both data types. The main pitfall of float16 is that due to the reduced range, it’s easy for the model to overflow to “infinity.” We found that one of our main layers, the DCNv2 cross layer, was sometimes overflowing during training with float16. This might be mitigated by tuning some hyperparameters (e.g. weight decay), but a slight risk would still remain, and a failure mode of “complete failure, no score predicted” is not ideal.

The main pitfall of bfloat16 is that due to the reduced precision, the model may have marginally worse predictions. Empirically, we found that our model can handle this just fine; there was no reduction in model accuracy. There is also a benefit of a better failure mode: “degraded prediction” is preferable to “no prediction.” Based on our results, we selected bfloat16 for the large layers of our model.

Lastly, there was the benchmarking. In offline testing, we found a 30% reduction in model inference time, with the same prediction accuracy. This inference time reduction translated well into production, and we obtained a significant reduction in infrastructure costs for our models.

Model Stability: Resilient Batch Norm

Improving the stability and training speed of deep learning models is a crucial task. To tackle this challenge, Batch Normalization (Batch Norm) has become a popular normalization method used by many practitioners. At Pinterest, we leverage Batch Norm in combination with other normalization techniques like minmax clip, log norm, and layer norm to effectively normalize our input data. However, we have encountered cases where Batch Norm itself can introduce model instability.

Let’s take a closer look at the formula for Batch Norm and its underlying process during the forward pass.

Batch Norm has two learnable parameters, namely beta and gamma, along with two non-learnable parameters, mean moving average and variance moving average. Here’s how the Batch Norm layer operates:

  1. Calculate Mean and Variance: For every activation vector, compute the mean and variance of all the values in the mini-batch.
  2. Normalize: Using the corresponding mean and variance, calculate the normalized values for each activation feature vector.
  3. Scale and Shift: Apply a factor, gamma, to the normalized values, and add a factor, beta, to it.
  4. Moving Average: Maintain an Exponential Moving Average of the mean and variance.

However, a challenge arises when the variance in step 2 becomes extremely small or even zero. In such instances, the normalized value, y, becomes abnormally large, leading to a value explosion within the model. Several common reasons behind this extremely small variance include stale or delayed feature values, feature absence, and distribution shifts with low coverage.To address these issues, we typically fill zeroes or use default values in the affected scenarios. Consequently, the variance computed in step 1 becomes zero. While increasing the mini-batch size and shuffling at the row level can help mitigate this problem, they don’t fully solve it. To overcome the instability caused by Batch Norm, we at Pinterest have developed a solution called Resilient Batch Norm.

Resilient Batch Norm introduces two crucial hyperparameters: minimal_variance and variance_shift_threshold. The forward pass in Resilient Batch Norm follows these steps:

  1. Calculate Mean and Variance for the mini-batch.
  2. Update Moving Average with specific conditions:
  3. If a variance is smaller than the minimal_variance hyperparameter, mask out the column from the running variance update.
  4. If a variance’s change ratio exceeds the variance_shift_threshold, mask out the column from the running variance update.
  5. Proceed to update the remaining running variance and running mean.
  6. Normalize using the running variance and running mean.
  7. Scale and Shift.

After conducting extensive experiments, we have observed no decrease in performance or training speed. By seamlessly replacing Batch Norm with Resilient Batch Norm, our models gain the ability to address the aforementioned feature problems and similar situations while achieving enhanced stability.

In conclusion, when faced with instability issues due to Batch Norm, adopting Resilient Batch Norm can provide a robust solution and improve the overall efficacy of the models.

Evaluation

In this section, we show some offline and online results for the user action sequence model on different view types (HomeFeed, RelatedPins, Search) and overall. The baseline model is our production model with DCNv2 [3] architecture and internal training data. It is to be noted that 0.1% offline accuracy improvement in the Engagement ranking model is considered to be significant. Thus, the user action sequence features and modeling techniques increase both online and offline metrics very significantly.

Conclusion

By leveraging realtime user sequence features, employing various modeling techniques such as transformers, feature interaction, feature connections, and pooling, the engagement model at Pinterest has been able to effectively adapt to users’ behavior and feedback, resulting in more personalized and relevant recommendations. The recognition of users’ long-term and short-term interests has been instrumental in achieving this objective. In order to account for both aspects, a comprehensive sequence is utilized to represent long-term interests, while the latest eight events are employed to capture short-term interests. This approach has significantly improved the model’s prediction performance; however, it has come at a considerable cost in terms of the added features and complexity of the models.

To mitigate the impact on serving efficiency and infrastructure costs, we have explored and implemented mixed precision inference techniques, utilizing lower precision (float16, bfloat16). This has effectively improved our serving efficiency while also reducing infrastructure costs. Additionally, we have addressed the challenge of making the model resilient to realtime changes, as we recognized the critical importance of these realtime sequence features. By incorporating a more resilient batch normalization technique, we are able to prevent abnormal value explosions caused by sudden changes in feature coverage or distribution shift.

As a result of these endeavors, Pinterest continues to deliver highly desirable, adaptive, and relevant recommendations that inspire and drive discovery for each unique user.

Acknowledgements

This work represents a result of collaboration of the conversion modeling team members and across multiple teams at Pinterest.

Engineering Teams:

Ads Ranking: Van Wang, Ke Zeng, Han Sun, Meng Qi

Advanced Technology Group: Yi-Ping Hsu, Pong Eksombatchai, Xiangyi Chen

Ads ML Infra: Shantam Shorewala, Kartik Kapur, Matthew Jin, Yiran Zhao, Dongyong Wang

User Sequence Support: Zefan Fu, Kimmie Hua

Indexing Infra: Kangnan Li, Dumitru Daniliuc

Leadership: Ling Leng, Dongtao Liu, Liangzhe Chen, Haoyang Li, Joey Wang, Shun-ping Chiu, Shu Zhang, Jiajing Xu, Xiaofang Chen, Yang Tang, Behnam Rezaei, Caijie Zhang

References

[1] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).

[2] Zhou, Guorui, et al. “Deep interest network for click-through rate prediction.” Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2018.

[3] Wang, Ruoxi, et al. “Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems.” Proceedings of the web conference 2021. 2021.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.

--

--