Pinterest的GPU加速的ML推理技术

Pong Eksombatchai | Software Engineer, Advanced Technology Group; Zhiyuan Zhang | Engineering Manager, ML Serving Platforms

Pong Eksombatchai | 软件工程师，先进技术组; 张志远 | 工程经理，ML服务平台

Three black computer fans

Image from https://unsplash.com/photos/vWgoeEYdtIY

图片来自https://unsplash.com/photos/vWgoeEYdtIY

We enabled serving 100x larger recommender models at Pinterest by transitioning our machine learning serving from CPU to GPU — increasing Homefeed Pinner engagement by 16% through a step function improvement in model quality. In this blog post, we’ll share our optimizations to achieve this at neutral cost and latency, including optimizing individual ops, consolidating memory transfers, executing static graphs on-device through CUDA Graphs, and rethinking our distributed system setup.

我们通过将机器学习服务从CPU过渡到GPU，在Pinterest实现了100倍大的推荐模型服务--通过模型质量的阶梯函数改进，将Homefeed Pinner参与度提高了16%。在这篇博文中，我们将分享我们在中性成本和延迟下实现这一目标的优化措施，包括优化单个操作、整合内存传输、通过CUDA Graphs在设备上执行静态图，以及重新思考我们的分布式系统设置。

Background

背景介绍

Pinterest’s mission is to bring everyone the inspiration to create a life they love. To make our mission a reality, one of the key components in all of our product surfaces are various recommender models whose jobs are to predict the right content to show to the right person at the right time. Our recommender models are machine learning models that we trained using advanced algorithms to understand Pinners’ behavior as they spend time on our app. We serve our recommender models using our in-house machine learning model server (Scorpion Model Server, or SMS).

Pinterest的使命是为每个人带来灵感，创造他们热爱的生活。为了实现我们的使命，我们所有产品表面的关键组成部分之一是各种推荐模型，其工作是预测正确的内容，在正确的时间展示给正确的人。我们的推荐模型是机器学习模型，我们使用先进的算法进行训练，以了解品客在我们的应用程序上花费时间时的行为。我们使用内部机器学习模型服务器（Scorpion Model Server，简称SMS）为我们的推荐模型服务。

The technical challenges that we deal with for SMS are very difficult as it has to provide 400+ million Pinners relevant recommendations from a corpus of 300+ billion Pins in milliseconds. SMS performs machine learning inference on CPU and is heavily optimized over the years to fit our stringent latency and infrastructure cos...