在PayPal重构和优化高流量的API

By Nisha Bhaskaran and Jeetendra Tiwari

作者:Nisha Bhaskaran和Jeetendra Tiwari

Photo by Jason Olliff on Unsplash

照片:Jason OlliffonUnsplash

Experimentation is increasingly becoming the standard for enabling data-driven decisions to determine the impact of every product change. It is an integral part of the product lifecycle at PayPal. Experiment Lifecycle Management and Optimization (ELMO), our in-house experimentation platform, is used to iterate and measure the impact of new product features, improved user experiences, marketing campaigns, etc.

实验正日益成为实现数据驱动决策的标准,以确定每个产品变化的影响。它是PayPal产品生命周期的一个组成部分。实验生命周期管理和优化(ELMO)是我们的内部实验平台,用于迭代和衡量新产品功能、改进用户体验和营销活动等的影响。

Client teams integrate with experimentation platform (using SDK’s) and make a service call (using evaluation APIs) for experiment evaluation in real-time based on the active experiment configuration and return the evaluated variant. Today, our focus will be on ELMO's evaluation APIs, which forms the crux of the problem statement.

客户团队与实验平台集成(使用SDK),并根据活动的实验配置实时调用服务(使用评估API)进行实验评估,并返回评估后的变量。今天,我们的重点将是ELMO的评估API,它构成了问题陈述的核心。

The evaluation APIs are critical endpoints which serve billions of requests per day from flows across multiple domains at PayPal and support different channels (such as the web and mobile). Earlier this year, we noticed the experience of using the APIs was sub-optimal, especially for our adjacencies. The SLA did not meet the standards that we set for ourselves and was also causing reliability issues for our clients. Therefore, we embarked on a journey to optimize the performance of the APIs by identifying critical bottlenecks in the flow.

评估API是关键的端点,每天为来自PayPal多个领域的流量提供数十亿次请求,并支持不同的渠道(如网络和移动)。今年早些时候,我们注意到使用API的体验是次优的,特别是对于我们的邻接点。SLA不符合我们为自己设定的标准,也给我们的客户带来了可靠性问题。因此,我们开始了通过识别流程中的关键瓶颈来优化API性能的旅程。

Defining performance

界定绩效

We defined Latency as network latency plus application request processing time. With our focus on optimizing the application request processing time, 3 parameters were chosen to define performance:

我们将Latency定义为网络延迟加上应用请求处理时间。由于我们的重点是优化应用程序的请求处理时间,我们选择了3个参数来定义性能。

  • Average latency
  • 平均延时
  • 95th percentile
  • 第95个百分点
  • 99th percentile ​​​​​​​
  • 第99个百分点

Complexity

复杂性

To understand the complexity of our evaluations, let us first get a sense of what an experiment setup looks like.​​​

为了理解我们评价的复杂性,让我们首先了解一下实验设置是什么样子的。

Experiment Setup

实验设置

Each experiment in ELMO has a control (default behavior) and one or more variations which are the new experiences being evaluated. Clients define experiment population by different attributes (for example, country) and define segments or cohorts for an experiment.

ELMO中的每个实验都有一个对照(默认行为)和一个或多个变化,这些变化是被评估的新经验。客户通过不同的属性(例如,国家)来定义实验人群,并为一个实验定义分段或队列。

To incorporate segments in experiments, Elmo is integrated with an in-house segmentation platform called Real-time Profile Store (RPS). RPS enables users to create segments or cohorts. Clients create or update segments in RPS and enable it for their experiment from ELMO by adding it as property in the experiment configuration.

为了在实验中纳入细分市场,Elmo与内部细分平台整合,称为RPS(Real-time Profile Store)。RPS使用户能够创建细分市场或群组。客户在RPS中创建或更新细分市场,并通过在实验配置中把它作为属性加入到ELMO的实验中。

Experiment Evaluation

实验评估

During evaluation of experiments, we use an identifier string as the key. When a client makes a service call, we evaluate all experiments that are associated with this key. For each key, there can be X experiments. And for each experiment, there can be Z treatments. Therefore, during evaluation, each user is evaluated against X * Z combinations.

在评估实验的过程中,我们使用一个标识符串作为密钥。当一个客户进行服务调用时,我们评估所有与这个键相关的实验。对于每个键,可以有X个实验。而对于每个实验,可以有Z个处理。因此,在评估过程中,每个用户都要针对X*Z的组合进行评估。

Moreover, for experiments that have a segment or cohort defined, we make a service call to RPS that conducts real-time evaluations, checking to see if the user or account id is part of that segment. Also, for post experiment analysis, measurement, and insights, we send events to a data acquisition service.

此外,对于定义了段或群组的实验,我们对RPS进行服务调用,进行实时评估,检查用户或账户ID是否属于该段。另外,对于实验后的分析、测量和洞察力,我们将事件发送到数据采集服务。

High level evaluation flow. Evaluation API calls are marked in orange

高水平的评估流程。评估的API调用被标记为橙色

Baseline Evaluation

基线评估

As with any optimization process, the first step was to set the baseline for these APIs. Once baselines were set, bottlenecks were identified.​

与任何优化过程一样,第一步是为这些API设定基线。一旦设定了基线,就会发现瓶颈。

Bottlenecks Identified​​​​​​​

确定的瓶颈

  • **​​​​​​​​​​​**Sequential load of experiments in case of cache miss — experiments were loaded in a sequential manner for a given identifier if they are not found in cache.
  • ****在高速缓存缺失的情况下顺序加载实验--如果在高速缓存中没有找到实验,则以顺序方式加载给定的标识符。
  • Using RxJava for the complete flow was a good strategy to parallelize tasks but resulted in every step being put in a BlockingObservable that became almost sequential.
  • 在整个流程中使用RxJava是一个很好的任务并行化策略,但是导致每一步都被放在BlockingObservable中,变得几乎是顺序的。
  • Redundant service calls to other services​​​​​​​​​​​​​​.
  • 对其他服务的冗余服务调用。

Optimization Techniques

优化技术

We spent time analyzing the baseline results, debugging, and identifying the bottlenecks. Once those were identified, we resorted to optimization techniques to improve performance:

我们花时间分析基线结果,进行调试,并确定瓶颈问题。一旦确定了这些,我们就采用优化技术来提高性能。

  1. The flow was broken down to a mixed mode of RxJava + non RxJava, so that we could parallelize where needed and consolidate I/O calls wherever possible.
  2. 流程被分解为RxJava+非RxJava的混合模式,这样我们就可以在需要的地方进行并行化,并尽可能合并I/O调用。
  3. The loading of experiments was isolated to evaluate RPS segments in bulk for all experiments associated with an identifier name for a given combination of user and segment evaluation type. This was done to avoid evaluating the segments for each experiment separately.
  4. 实验的加载被隔离开来,对所有与用户和段评价类型的特定组合的标识符名称相关的实验进行RPS段的批量评价。这样做是为了避免单独评估每个实验的段。

Segment Evaluation​​​​​​​

分段评价

E1, E2 = experiments, U1 = audience, Seg1, Seg2, Seg3, Seg4, Seg5 = segments in RPS which are added as properties in ELMO for runtime evaluation.

E1, E2 = 实验,U1 = 观众,Seg1, Seg2, Seg3, Seg4, Seg5 = RPS中的片段,在ELMO中被添加为属性,用于运行时评估。

Outcome

结果

The improvement percentages in the service calls latency

Conclusion

总结

The optimization outcome result shared reflects the SLA gain based on data per day. This was a good starting point in our journey to scale our platform. We also have clients performing the experiment evaluation locally which caches the active experiment configuration and properties. This further helps reduce the number of evaluation API network calls and contributes to our efforts to scaling our platform.

分享的优化结果反映了基于每天数据的SLA收益。这是我们在扩展平台的过程中一个很好的起点。我们也有客户在本地进行实验评估,它缓存了活动实验配置和属性。这进一步有助于减少评估API网络调用的数量,并有助于我们努力扩大我们的平台规模。

首页 - Wiki
Copyright © 2011-2022 iteam. Current version is 2.97.1. UTC+08:00, 2022-08-08 09:55
浙ICP备14020137号-1 $访客地图$