Pinterest如何通过有效的回填加速ML特征迭代

[

[

Pinterest Engineering

](https://medium.com/@Pinterest_Engineering?source=post_page---byline--d67ea125519c---------------------------------------)

](https://medium.com/@Pinterest_Engineering?source=post_page---byline--d67ea125519c---------------------------------------)

Authors: Kartik Kapur, Tech Lead, Sr Software Engineer | Matthew Jin, Sr Software Engineer | Qingxian Lai, Staff Software Engineer

作者:Kartik Kapur,技术负责人,高级软件工程师 | Matthew Jin,高级软件工程师 | Qingxian Lai,员工软件工程师

Context

背景

At Pinterest, our mission is to inspire users to curate a life they love. To achieve this, we rely on state-of-the-art Recommendation and Ads models trained on tens of petabytes of data over the span of many months of engagement logs. These models drive personalized recommendations, showing users content that resonates with their interests. These models show significantly better performance when trained on large datasets with events spanning over many months of events.

在 Pinterest,我们的使命是激励用户策划他们热爱的生活。为了实现这一目标,我们依赖于最先进的推荐和广告模型,这些模型在数十个 PB 的数据上经过数月的参与日志训练。这些模型驱动个性化推荐,向用户展示与他们兴趣相关的内容。当在大数据集上训练,且事件跨越数月时,这些模型的表现显著更好。

Our ML Models are trained on a wide range of features, including Pin, user, advertiser-level, and session-based features. Experimenting with these features is a common task, and the first step in this process is integrating new features into the training dataset.

我们的机器学习模型在广泛的特征上进行训练,包括Pin、用户、广告主级别和基于会话的特征。对这些特征进行实验是一个常见任务,该过程的第一步是将新特征集成到训练数据集中。

The most straightforward method of incorporating features into the training dataset is through Forward Logging: adding the features into our serving logs and waiting for it to accumulate enough data for training.

将特征纳入训练数据集的最直接方法是通过前向日志记录:将特征添加到我们的服务日志中,并等待其积累足够的数据进行训练。

However, this method presents several challenges:

然而,这种方法存在几个挑战:

  • High Calendar Day Cost: Every iteration takes 3~6 months to be hydrated in the training dataset.
  • 高日历天成本: 每次迭代需要3~6个月才能在训练数据集中得到补充。
  • High Development Time Cost: Introducing new features in logging touches multiple systems, including training and serving logic.
  • 高开发时间成本:在日志中引入新功能...
开通本站会员,查看完整译文。

Главная - Вики-сайт
Copyright © 2011-2025 iteam. Current version is 2.143.0. UTC+08:00, 2025-05-31 19:54
浙ICP备14020137号-1 $Гость$