调试由于上下文泄露造成的高延迟

Market-Store is an in-house developed general purpose feature store that is used to serve real-time computed machine learning (ML) features. Market-Store has a stringent SLA around latency, throughput, and availability as it empowers ML models, which are used in Dynamic Pricing and Consumer Experience.

Market-Store是一个内部开发的通用功能存储,用于提供实时计算的机器学习(ML)功能。Market-Store在延迟、吞吐量和可用性方面有严格的SLA,因为它支持ML模型,这些模型用于动态定价和消费者体验。

Problem

问题

As Grab continues to grow, introducing new ML models and handling increased traffic, Market-Store started to experience high latency. Market-Store’s SLA states that 99% of transactions should be within 200ms, but our latency increased to 2 seconds. This affected the availability and accuracy of our models that rely on Market-Store for real-time features.

随着 Grab 的不断发展,引入新的 ML 模型和处理增加的流量,Market-Store 开始出现高延迟。Market-Store 的 SLA 规定,99% 的交易应在 200 毫秒内完成,但我们的延迟增加到 2 秒。这影响了我们依赖Market-Store的实时功能的模型的可用性和准确性。

Latency Issue

延迟问题

We used different metrics and logs to debug the latency issue but could not find any abnormalities that directly correlated to the API’s performance. We discovered that the problem went away temporarily when we restarted the service. But during the next peak period, the service began to struggle once again and the problem became more prominent as Market-Store’s query per second (QPS) increased.

我们使用不同的指标和日志来调试延迟问题,但没有发现任何与API性能直接相关的异常情况。我们发现,当我们重新启动服务时,这个问题暂时消失了。但在下一个高峰期,服务再次开始挣扎,随着Market-Store的每秒查询量(QPS)的增加,这个问题变得更加突出。

The following graph shows an increase in the memory used with time over 12 hours. Even as the system load receded, memory usage continued to increase.

下图显示了在12个小时内使用的内存随时间增加的情况。即使在系统负载减弱的情况下,内存使用量也继续增加。

The continuous increase in memory consumption indicated the possibility of a memory leak, which occurs when memory is allocated but not returned after its use is over. This results in consistently increasing consumed memory until the service runs out of memory and crashes.

内存消耗的持续增加表明了内存泄漏的可能性,当内存被分配但在其使用结束后没有返回时,就会发生内存泄漏。这导致消耗的内存持续增加,直到服务的内存耗尽而崩溃。

Although we could restar...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.137.1. UTC+08:00, 2024-11-22 11:35
浙ICP备14020137号-1 $访客地图$