顺序A/B测试让全球流媒体Netflix持续发展 第1部分:连续数据

Michael Lindon, Chris Sanden, Vache Shirikian, Yanjun Liu, Minal Mishra, Martin Tingley

Michael LindonChris SandenVache ShirikianYanjun LiuMinal MishraMartin Tingley

Using sequential anytime-valid hypothesis testing procedures to safely release software

1. Spot the Difference

1. 寻找差异

Can you spot any difference between the two data streams below? Each observation is the time interval between a Netflix member hitting the play button and playback commencing, i.e., play-delay. These observations are from a particular type of A/B test that Netflix runs called a software canary or regression-driven experiment. More on that below — for now, what’s important is that we want to quickly and confidently identify any difference in the distribution of play-delay — or conclude that, within some tolerance, there is no difference.

你能发现下面两个数据流之间的任何差异吗?每个观察值都是Netflix会员点击播放按钮和播放开始之间的时间间隔,即播放延迟。这些观察值来自Netflix进行的一种名为软件金丝雀或回归驱动实验的特定类型的A/B测试。稍后会详细介绍,现在重要的是我们希望能够快速而有信心地确定播放延迟分布的任何差异,或者得出结论,在某个容忍度内,没有差异。

In this blog post, we will develop a statistical procedure to do just that, and describe the impact of these developments at Netflix. The key idea is to switch from a “fixed time horizon” to an “any-time valid” framing of the problem.

在本博文中,我们将开发一种统计程序来实现这一目标,并描述这些发展对Netflix的影响。关键思想是从“固定时间范围”转换为“任何时间有效”的问题框架。

Sequentially comparing two streams of measurements from treatment and control

Figure 1. An example data stream for an A/B test where each observation represents play-delay for the control (left) and treatment (right). Can you spot any differences in the statistical distributions between the two data streams?

图1. A/B测试的示例数据流,其中每个观测值代表对照组(左侧)和治疗组(右侧)的播放延迟。您能发现两个数据流之间的统计分布有何不同吗?

2. Safe software deployment, canary testing, and play-delay

2. 安全的软件部署、金丝雀测试和播放延迟

Software engineering readers of this blog are likely familiar with unit, integration and load testing, as well as other testing practices that aim to prevent bugs from reaching production systems. Netflix also performs canary tests — software A/B tests between current and newer software versions. To learn more, see our previous blog post on Safe Updates of Client Applications.

阅读本博客的软件工程读者可能熟悉单元测试、集成测试和负载测试,以及其他旨在防止错误进入生产系统的测试实践。Netflix还执行...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.132.0. UTC+08:00, 2024-09-20 00:22
浙ICP备14020137号-1 $访客地图$