从静态速率限制到自适应流量管理:Airbnb 的 Key-Value Store 演进之路

[

[

Shravan Gaonkar

](https://medium.com/@gaonkar?source=post_page---byline--29362764e5c2---------------------------------------)

](https://medium.com/@gaonkar?source=post_page---byline--29362764e5c2---------------------------------------)

How Airbnb hardened Mussel, our key-value store, with smarter traffic controls to stay fast and reliable during traffic spikes.

Airbnb 如何通过更智能的流量控制加固我们的键值存储 Mussel,使其在流量激增时依然快速可靠。

Press enter or click to view image in full size

按回车或点击以全尺寸查看图片

By Shravan Gaonkar, Casey Getz, Wonhee Cho

作者:Shravan GaonkarCasey GetzWonhee Cho

Introduction

引言

Every request lookup on Airbnb, from stays, experiences, and services search to customer support inquiries ultimately hits Mussel, our multi-tenant key-value store for derived data. Mussel operates as a proxy service, deployed as a fleet of stateless dispatchers — each a Kubernetes pod. On a typical day, this fleet handles millions of predictable point and range reads. During peak events, however, it must absorb several-fold higher volume, terabyte-scale bulk uploads, and sudden bursts from automated bots or DDoS attacks. Its ability to reliably serve this volatile mix of traffic is therefore critical to both the Airbnb user experience and the stability of the many services that power our platform.

Airbnb 上的每一次请求查询,无论是住宿、体验、服务搜索还是客户支持咨询,最终都会访问Mussel——我们为派生数据构建的多租户键值存储。Mussel 以代理服务形式运行,部署为一组无状态的调度器,每个调度器都是一个 Kubernetes Pod。在平常的一天里,这组调度器会处理数百万次可预测的点查和范围读取。然而,在高峰活动期间,它必须吸收数倍于平时的流量、TB 级的大规模批量上传,以及来自自动化机器人或 DDoS 攻击的突发流量。因此,可靠地服务这种多变的流量组合,对 Airbnb 的用户体验以及支撑我们平台的众多服务的稳定性至关重要。

Given Mussel’s traffic volume and its role in core Airbnb flows, quality of service (QoS) is one of the product’s defining features. The first-generation QoS system was primarily an isolation tool. It relied on a Redis-backed counter, client quota based rate-limiter, that checked a caller’s requests per second (QPS) against a configurable fixed quota. The goal was to prevent a single misbehaving client from overwhelming the service and causing a complete outage. For this purpose, it was simple ...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.146.0. UTC+08:00, 2025-10-12 02:13
浙ICP备14020137号-1 $访客地图$