调试百万分之一的失败:将Pinterest的搜索基础设施迁移到Kubernetes
[
[
Sanson Hu, Shashank Tavildar, Eric Kalkanger, Hunter Gatewood
Sanson Hu, Shashank Tavildar, Eric Kalkanger, Hunter Gatewood
While migrating Pinterest’s search infrastructure — which powers core experiences for millions of users monthly — to Kubernetes, we faced a challenge in the new environment: one in every million search requests took 100x longer than usual.
在迁移Pinterest的搜索基础设施 — 这为 每月数百万用户提供核心体验 — 到Kubernetes时,我们在新环境中面临一个挑战:每百万个搜索请求中就有一个请求的耗时比平常长100倍。
This post chronicles our investigation, uncovering an elusive interaction between our memory-intensive search system and a seemingly innocent monitoring process. The journey involves profiling search systems, debugging performance issues, Linux kernel features, and memory management.
这篇文章记录了我们的调查,揭示了我们内存密集型搜索系统与一个看似无害的监控过程之间的微妙互动。这个旅程涉及搜索系统的分析、性能问题的调试、Linux内核特性和内存管理。
Migrating Manas to Kubernetes
将Manas迁移到Kubernetes
At Pinterest, search is a critical component of our recommendation system. When users visit their home feed, type a search query, or view related content, the results likely come from search.
在Pinterest,搜索是我们推荐系统的关键组成部分。当用户访问他们的主页、输入搜索查询或查看相关内容时,结果很可能来自搜索。
To fulfill these searches at Pinterest-scale, we built an in-house search system called Manas. Today, Manas serves dozens of search indices empowering a wide array of teams within Pinterest to build performant recommendation features, and it is one of the most important services within Pinterest. Underneath it all, Manas manages more than 100 search clusters across thousands of hosts via a custom cluster management solution.
为了在 Pinterest 规模下满足这些搜索需求,我们构建了一个内部搜索系统,称为 Manas。今天,Manas 为数十个搜索索引提供服务,支持 Pinterest 内部各个团队构建高性能推荐功能,它是 Pinterest 内部最重要的服务之一。在这一切之下,Manas 通过自定义集群管理解决方案管理着超过100个搜索集群,分布在数千个主机上。
Over the past eight years since its inception in 2017, thi...