Lyft的强化学习平台

While there are some fundamental differences between RL and supervised learning, we were able to extend our existing model training and serving systems to accommodate for the new technique. The big advantage of this approach is that it allows us to leverage lots of proven platform components.

虽然强化学习和监督学习之间存在一些基本差异，但我们能够扩展现有的模型训练和服务系统以适应这种新技术。这种方法的重要优势在于它允许我们利用许多经过验证的平台组件。

Architecture

架构

This blog post gives a more detailed overview of our model hosting solution, LyftLearn Serving. Here we want to focus on the modifications required to support RL models which include:

这篇博文更详细地介绍了我们的模型托管解决方案LyftLearn Serving。在这里，我们想重点介绍支持RL模型所需的修改，包括：

Providing the action space for every scoring request
为每个评分请求提供动作空间
Logging the action propensities
记录行动倾向
Emitting business event data from the application that allows for calculation of the reward, e.g. that a recommendation was clicked on. This step can be skipped if the reward data can be inferred from application metrics that are already logged. However, if those metrics depend on an ETL pipeline, the recency of training data will be limited by that schedule.
从应用程序中发出业务事件数据，以便计算奖励，例如推荐被点击。如果奖励数据可以从已经记录的应用程序指标中推断出来，则可以跳过此步骤。但是，如果这些指标依赖于ETL流水线，则训练数据的新鲜度将受到该计划的限制。

Architecture diagram showing how models are registered to the backend, synced to the serving instance where they accept requests from the client. The logged data is fed into a data warehouse from where it’s consumed by the policy update job which registers the updated model to the backend.

RL Platform System Architecture

RL平台系统架构

There are two entry points for adding models to the system. The Experimentation Interface allows for kicking off an experiment with an untrained bandit model. The blank model only starts learning in production as it observes feedback for the actions it takes and is typically used for more efficient A/B tests. The Model Development way is more suitable for sophisticated models that are developed in source-controlled repositories and potentially pre-trained offline. This flow is very similar to the existing supervised learning model development and deployment.

向系统添加模型有两个入口点。"实验界面"允许使用未经训练的强化学习模型启动实验。空白模型只有在观察到其采取的行动的反馈后才开始在生产环境中学习，并且通常用于更高效的A/B测试。"模型开发"方式更适用于在源代码控制存储库中开发和可能进行离线预训练的复杂模型。此流程与现有的监督学习模型开发和部署非常相似。

The models are registered with the Model Database and loaded into the LyftLear...