CRISP:微服务架构的关键路径分析
Uber’s backend is an exemplar of microservice architecture. Each microservice is a small, individually deployable program performing a specific business logic (operation). The microservice architecture is a type of distributed computing system, which is suitable for independent deployments and scaling of software programs, and so is widely used across modern service-oriented industries. Uber has a few thousand microservices interacting with one another via remote procedure calls (RPC).
Uber的后端是微服务架构的一个典范。每个微服务都是一个可单独部署的小程序,执行一个特定的业务逻辑(操作)。微服务架构是一种分布式计算系统,适用于软件程序的独立部署和扩展,因此在现代面向服务的行业中被广泛使用。Uber有几千个微服务,通过远程过程调用(RPC)相互作用。
A service request arriving at an entry point (aka end-point) to the Uber backend systems undergoes multiple “hops” through numerous microservice operations before being fully serviced. The life of a request results in complex microservice interactions. These interactions are deeply nested, asynchronous, and invoke numerous other downstream operations. As a result of this complexity, it is very hard to identify which underlying service(s) contribute to the overall end-to-end latency experienced by a top-level request. Answering this question is critical in many situations, for example:
一个到达Uber后台系统入口点(又称端点)的服务请求,在被完全服务之前,要经过许多微服务操作的多次 "跳转"。一个请求的生命周期会导致复杂的微服务互动。这些互动是深度嵌套的,异步的,并调用许多其他下游操作。由于这种复杂性,很难确定哪些底层服务对顶级请求所经历的整体端到端延迟有贡献。在许多情况下,回答这个问题是至关重要的,例如。
- Identifying optimization opportunities for a top-level microservice
- 识别顶层微服务的优化机会
- Identifying common bottleneck operations affecting many services
- 识别影响许多服务的共同瓶颈业务
- Setting appropriate time-to-live values for downstream RPC calls
- 为下游的RPC调用设置适当的生存时间值
- Diagnosing outages and error conditions
- 诊断停电和错误状况
- Capacity planning and reduction
- 能力规划和减少
While latency is one of the metrics of interest, other metrics such as time-to-live, error rates, etc., also fall in the scope.
虽然延迟是感兴趣的指标之一,但其他指标,如生存时间、错误率等,也属于这个范围。
We have developed a tool, CRISP (named taking letters from critical and span), to pinpoint and quantify underlying services that impact the overall latency...