迁移政策交付引擎,几乎无人知晓
Jeremy Krach | Staff Security Engineer, Platform Security
Jeremy Krach | 平台安全工程师
Background
背景
Several years ago, Pinterest had a short incident due to oversights in the policy delivery engine. This engine is the technology that ensures a policy document written by a developer and checked into source control is fully delivered to the production system evaluating that policy, similar to OPAL. This incident began a multi-year journey for our team to rethink policy delivery and migrate hundreds of policies to a new distribution model. We shared details about our former policy delivery system in a conference talk from Kubecon 2019.
几年前,Pinterest发生了一个短暂的事件,原因是策略交付引擎的疏忽。该引擎是一种技术,确保开发人员编写并检入源代码控制的策略文档完全交付给评估该策略的生产系统,类似于OPAL。这个事件开始了我们团队多年的旅程,重新思考策略交付,并将数百个策略迁移到新的分发模型。我们在Kubecon 2019的会议演讲中分享了有关我们以前的策略交付系统的详细信息。
At a high level, there are three important architectural decisions we’d like to bring attention to for this story.
从高层次来看,对于这个故事,我们想要引起注意的有三个重要的架构决策。
Figure 1: Old policy distribution architecture, using S3 and Zookeeper.
图1:旧的策略分发架构,使用S3和Zookeeper。
- Pinterest provides a wrapper service around OPA in order to manage policy distribution, agent configuration metrics, logging, and simplified APIs.
- Pinterest在OPA周围提供了一个包装服务,用于管理策略分发、代理配置指标、日志记录和简化的API。
- Policies were fetched automatically via Zookeeper as soon as a new version was published.
- 策略会在新版本发布后立即通过Zookeeper自动获取。
- Policies lived in a shared Phabricator repository that was published via a CI workflow.
- 策略存储在一个共享的Phabricator存储库中,通过CI工作流程发布。
So where did this go wrong? Essentially, bad versions (50+ at the time) of every policy were published simultaneously due to a bad commit to the policy repository. These bad versions were published to S3, with new versions registered in Zookeeper and pulled directly into production. This caused many of our internal services to fail simultaneously. Fortunately a quick re-run of our CI published known good versions that were (again) pulled directly into production.
那么出了什么问题呢?基本上,由于对策略存储库的错误提交,每个策略的错误版本(当时超过50个)同时发布。这些错误版本被发布到S3,并在Zookeep...