Multi-gate-Mixture-of-Experts (MMoE) model architecture and knowledge distillation in Ads Engagement modeling development

6 min read3 days ago

Authors: Jiacheng Li | Machine Learning Engineer II, Ads Ranking; Matt Meng | Staff Machine Learning Engineer, Ads Ranking; Kungang Li | Principal Machine Learning Engineer, Ads Performance; Qifei Shen | Senior Staff Machine Learning Engineer, Ads Ranking

Introduction

Multi-gate Mixture-of-Experts (MMoE)[1,2] is a recent industry-proven powerful architecture in neural network models that offers several significant benefits. First, it enhances model efficiency by dynamically allocating computational resources to different sub-networks (experts) based on the input data, ensuring that only the most relevant experts are activated for each task. This selective activation reduces computational overhead and improves inference speed. Second, MMoE promotes better generalization and performance by allowing the model to learn specialized features through multiple experts, each focusing on different aspects of the data. This specialization helps in capturing complex patterns and relationships that a single monolithic model might miss. Additionally, the multi-gate mechanism enables the model to handle multi-task learning more effectively, as it can tailor the contribution of each expert to different tasks, leading to improved accuracy and robustness across various applications. Overall, MMoE provides a flexible, efficient, and powerful approach to building advanced neural network models.

On top of the MMoE model architecture, we also propose to use knowledge distillation[4] to mitigate the performance gap due to short data retention period and further enhance the new model performance. More specifically, we distill knowledge from existing production models to experimental models during the batch training stage to make the experimental models converge to a better state.

Multi-gate-Mixture-of-Experts (MMoE) model architecture

A known problem for the DCNv2[4] style recommendation system model is that simply adding more layers and more parameters cannot bring proportional metric gain. How to effectively make the model larger and bring proportional sizable gain is a question for all. The MMoE model architecture is a potential solution to help our engagement model learn more complicated patterns and relationships between users and their ads engagement actions, and eventually perform better at the ads-user matching task.

We started from our current shared-bottom model architecture with DCNv2. We experimented with various architectures for experts and their combinations: DCNv2, Masknet[5], FinalMLP[6] etc. Besides these seemingly advanced architectures, we also notice that adding MLP-based experts can further lift the offline metrics. Through experiments, we realized that the return of investment (ROI) is decreasing with more experts added and for our use case DCNv2 has the highest ROI. Therefore we chose the most suitable combination of experts via careful consideration of metrics-infrastructure cost tradeoff.

Figure 1: (a) Shared-Bottom model. (b) One-gate MoE model. © Multi-gate MoE model. From [2]

Serving an MMoE style model would bring noticeable infrastructure cost increase since multiple experts are introduced, and we also explored various techniques to reduce the infrastructure cost while keeping the model performance on-par. One promising solution is mixed precision inference. Luckily our team has already explored and productionized mixed precision inference [7] and we can apply this great technique out of the box. From experiments, we verified that mixed precision inference has nearly zero impact on model offline performance. By applying this mixed precision inference, we observe 40% inference latency reduction in benchmarking and this latency reduction translates to a significant amount of infrastructure cost.

Besides mixed precision inference, through experiments, we observed that the gate layers of MMoE only need very simple architecture and small amounts of parameters to achieve similar performance as very sophisticated architectures that we use for experts. Lightweight gate layers bring further infrastructure cost reduction.

Knowledge distillation from production model

In the ads engagement modeling world, the volume of data is huge. A common practice is to retain the data for a relatively short period of time, usually a few months to a year. When a new modeling idea is well tested offline and ready for online experimentation to validate online metrics movements, a common issue is that the training data of the current production model is no longer available, making the comparison between the new experimental model and the production model unfair.

To mitigate this issue, we proposed to use knowledge distillation from the production model to help the new experimental model “learn” from old data that has been deleted from the training dataset, thus enhancing the performance of the new experimental model. More specifically, on top of the standard cross entropy loss calculated using binary labels, we add a new loss to calculate the prediction differences between the experimental model and the production model. We experimented with various loss functions and we found out that pairwise style loss not only can mitigate the performance gap caused by missing data, but also can boost the experimental model offline metrics further. Applying the knowledge distillation loss in the batch training stage seems undoubtedly correct, but the story is different for the incremental training stage. Moreover, when the experimental model is promoted to production, the question becomes harder to answer: should a model distill from itself? Through experiments, in our current design, we observe significant overfitting if we keep the distillation loss in incremental training and thus we decide to remove distillation loss in the incremental training stage.

Besides smoothing the metric movements oriented model training, knowledge distillation can also help for no metric movement cases, such as feature upgrade or new computation graph improvement for serving latency reduction. In such cases, warm starting from a production model checkpoint is no longer available and retraining a production model is necessary. Retraining a production model will also suffer from the above mentioned data missing problem, and knowledge distillation can kick in here to make sure the retrained production model can have on-par offline and online performance.

Evaluation

In this section, we show some offline and online results [8] for the MMoE model on different view types (RelatedPins and Search). The baseline model is our production model with DCNv2 architecture and internal training data. It is to be noted that 0.1% offline accuracy improvement in the Engagement ranking model is considered to be significant. Therefore the MMoE model architecture with knowledge distillation increases both online and offline metrics very significantly.

Conclusion

MMoE model architecture is capable of modeling more sophisticated tasks, especially in multitask learning (MTL).

By leveraging knowledge distillation technique, we successfully mitigate the performance gap caused by short data retention period and help the new experimental model learn from deleted data. The practically longer training data window and large data volume can effectively help us to improve the ads matching quality and Pinner experience.

As a result of these endeavors, Pinterest continues to deliver highly desirable, adaptive, and relevant recommendations that inspire and drive discovery for each unique user.

Acknowledgements

This work represents a result of collaboration of the engagement modeling team members and across multiple teams at Pinterest.

Engineering Teams:
Ads Ranking: Duna Zhan, Liangzhe Chen, Dongtao Liu
Ads Infra: Shantam Shorewala, Yiran Zhao, Haoyang Li
Data Science: Milos Curcic, Adriaan ten Kate
Leadership: Ling Leng, Caijie Zhang, Prathibha Deshikachar

References

[1] Jacobs, Robert A., et al. “Adaptive mixtures of local experts.” Neural computation 3.1 (1991): 79–87.
[2] Ma, Jiaqi, et al. “Modeling task relationships in multi-task learning with multi-gate mixture-of-experts.” Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2018.
[3] Hinton, Geoffrey, et al. “Distilling the Knowledge in a Neural Network.” arXiv preprint arXiv:1503.02531.
[4] Wang, Ruoxi, et al. “Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems.” Proceedings of the web conference 2021. 2021.
[5] Wang, Zhiqiang, et al. “Masknet: Introducing feature-wise multiplication to CTR ranking models by instance-guided mask.” arXiv preprint arXiv:2102.07619 (2021).
[6] Mao, Kelong, et al. “FinalMLP: an enhanced two-stream MLP model for CTR prediction.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37. №4. 2023.
[7] Lei, Yulin, et al. “User Action Sequence Modeling for Pinterest Ads Engagement Modeling”. Pinterest Engineering Blog.
[8] Pinterest Internal Data, US, 2024.

Pinterest Engineering Blog

Multi-gate-Mixture-of-Experts (MMoE) model architecture and knowledge distillation in Ads Engagement modeling development

Introduction

Multi-gate-Mixture-of-Experts (MMoE) model architecture

Knowledge distillation from production model

Evaluation

Conclusion

Acknowledgements

References

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Pinterest Engineering Blog

Written by Pinterest Engineering

No responses yet