Migrating Policy Delivery Engines with (almost) Nobody Knowing

Jeremy Krach | Staff Security Engineer, Platform Security

Background

Several years ago, Pinterest had a short incident due to oversights in the policy delivery engine. This engine is the technology that ensures a policy document written by a developer and checked into source control is fully delivered to the production system evaluating that policy, similar to OPAL. This incident began a multi-year journey for our team to rethink policy delivery and migrate hundreds of policies to a new distribution model. We shared details about our former policy delivery system in a conference talk from Kubecon 2019.

At a high level, there are three important architectural decisions we’d like to bring attention to for this story.

Figure 1: Old policy distribution architecture, using S3 and Zookeeper.

Pinterest provides a wrapper service around OPA in order to manage policy distribution, agent configuration metrics, logging, and simplified APIs.
Policies were fetched automatically via Zookeeper as soon as a new version was published.
Policies lived in a shared Phabricator repository that was published via a CI workflow.

So where did this go wrong? Essentially, bad versions (50+ at the time) of every policy were published simultaneously due to a bad commit to the policy repository. These bad versions were published to S3, with new versions registered in Zookeeper and pulled directly into production. This caused many of our internal services to fail simultaneously. Fortunately a quick re-run of our CI published known good versions that were (again) pulled directly into production.

This incident led multiple teams to begin rethinking global configuration (like OPA policy). Specifically, the Security team and Traffic team at Pinterest began collaborating on a new configuration delivery system that would provide a mechanism to define deployment pipelines for configuration.

This blog post is focused on how the Security team moved hundreds of policies and dozens of customers from the Zookeeper model to a safer, more reliable, and more configurable config deployment approach.

Technology

The core configuration delivery story here isn’t the Security team’s to tell — Pinterest’s Traffic team worked closely with us to understand our requirements, and that team was ultimately responsible for building out the core technology to enable our integration.

Generally speaking, the new configuration management system works as follows:

Config owners create their configuration in a shared repository.
Configs are grouped by service owners into “artifacts” in a DSL in that repository.
Artifacts are configured with a pipeline, also in a DSL in that repository. This defines which systems receive the artifact and when.

Each pipeline defines a set of steps and a set of delivery scopes for each step. These scopes are generated locally on each system that would like to retrieve a configuration. For example, one might define a pipeline that first delivers to the canary system and then the production system, (simplified here):

The DSL also allows for configuration around how pipeline steps are promoted — automatic (within business hours), automatic (24x7), and manual. It also allows for configuration of metric thresholds that must not be exceeded before proceeding to the next step.

The actual distribution technology is not dissimilar to the original architecture. Now, instead of publishing policy in a global CI job, each artifact (group of policy and other configuration) has a dedicated pipeline to define the scope of delivery and the triggers for the delivery. This ensures each policy rollout is isolated to just that system and can have whatever deployment strategy and safety checks that the service owner deems appropriate. A high-level architecture can be seen below.

Figure 2: New policy distribution architecture, using Config server/sidecar and dedicated UI.

Migration

Before we could begin migrating policies from a global, instantaneous deployment model to a targeted, staged deployment model, a lot of information needed to be collected. Specifically, for each policy file in our old configuration repository we needed to identify:

The service and Github team associated with the policy
The systems using the policy
The preferred deploy order for the systems using the policy

Fortunately, most of this information was readily available from a handful of data sources at Pinterest. During this first phase of the migration, we developed a script to collect all this metadata about each policy. This involved: reading each policy file to pull the associated service name from a mandatory tag comment, fetching the Github team associated with the service from our internal inventory API, getting metrics for all systems with traffic for the policy, and grouping those systems into a rough classification based on a few common naming conventions. Once this data was generated, we exported it to Google sheets in order to annotate it with some manual tweaks. Namely, some systems were misattributed to owners due to stale ownership data, and many systems did not follow standard, predictable naming conventions.

The next piece of tooling we developed was a script that took a few pieces of input: the path to the policy to be migrated, the team names, and the deployment steps. This automatically moved the policy from the old repository to the new one, generated an artifact that included the policy, and defined a deployment pipeline for the associated systems attributed to the service owner.

With all this tooling in hand, we were ready to start testing the migration tooling against some simple examples.

Phase 2: Cutover Logic

Prior to the new policy delivery model, teams would define their policy subscriptions in a config file managed by Telefig. One of our goals for this migration was ensuring a seamless cutover that required minimal or no customer changes. Since the new configuration management provided the concept of scopes and defined the policy subscription in the configuration repository, we could rely purely on the new repository to define where policies were needed. We needed to update our sidecar (the OPA wrapper) to generate subscription scopes locally during start-up based on system attributes. We chose to generate these scopes based on the SPIFFE ID of the system, which allowed us to couple the deployments closely to the service and environment of the host.

We also recognized that since the configuration system can deliver arbitrary configs, we could also deliver a configuration telling our OPA wrapper to switch its behavior. We implemented this cutover logic as a hot-reload of configuration in the OPA wrapper. When a new configuration file was created, the OPA wrapper detects the new configuration and changes the following properties:

Where the policies are stored on disk (reload of the OPA runtime engine)
How the policies are updated on disk (ZooKeeper subscription defined by customer managed configuration file vs. doing nothing and allowing the configuration sidecar to manage it)
Metric tags, to allow detection of cutover progress

Figure 3: Flowchart of the policy cutover logic.

One benefit of this approach is that reverting the policy distribution mechanism could be done completely in the new system. If a service did not work well with the new deployment system, we could use the new deployment system to update the new configuration file to tell the OPA wrapper to use the legacy behavior. Switching between modes could be done seamlessly with no downtime or impact to customers using policies.

Since both the policy setup and the cutover configuration could happen in a single repository, each policy or service could be migrated with a single pull request without any need for customer input. All files in the new repository could be generated with our previously-built tooling. This set the stage for a long series of migrations with localized impact to only the policy being migrated.

Phase 3: Tons of Pull Requests

At this point, the foundation was laid to begin the migration in earnest. Over the course of a month or two, we began auto-generating pull-requests scoped to single teams or policy. Primarily Security and Traffic team members generated and reviewed these PRs to ensure the deployments were properly scoped, associated with the correct teams, and rolled out successfully.

As mentioned before, we had hundreds of policies that needed to be migrated, so this was a steady but long process of moving policies in chunks. As we gained confidence in our tooling, we ramped up the number of policies migrated in a given PR from 1–2 to 10–20.

Phase 4: Edge Cases

As with any plan, there were some unforeseen issues that came up as we deployed policies to a more diverse set of systems. What we found was that some of our older stateful systems were running an older machine image (AMI) that did not support subscription declaration. This presented an immediate roadblock for progress on systems that could not easily be relaunched.

Fortunately, our Continuous Deployment team was actively revising how the Telefig service receives updates. We worked closely with the CD team to ensure that we dynamically upgraded all systems at Pinterest to use the latest version of Telefig. This unblocked our work and allowed us to continue migrating the remaining use cases.

Phase 5: Smooth Landing

Once we resolved the old Telefig version issue, we quickly worked with the few teams that owned the bulk of the remaining policies to get everything moved over into the new configuration deployment model. Below is a rough timeline of the migration:

Figure 4: Timeline of the migration to the new policy framework.

Once the metrics above stabilized at 100%, we began cleaning up the old tooling. This allowed us to delete hundreds of lines of code and greatly simplify the OPA wrapper, as it no longer had to build in policy distribution logic.

At the end of this process, we now have a safer policy deployment platform that allows our teams to have full control over their deployment pipelines and fully isolate each deployment from policies not included in that deployment.

Conclusion

Migrating things is hard. There’s always resistance to a new workflow, and the more people that have to interact with it, the longer the tail on the migration. The main takeaways from this migration are as follows.

Focus on measurement first. In order to stay on track, you need to know who will be impacted, the scope of what work remains, and what big wins are behind you. Having good measurement also helps justify the project and gives a great set of resources to brag about accomplishments at milestones along the way.

Secondly, migrations generally follow the Pareto Principle. Specifically, 20% of the use-cases to be migrated will generally account for 80% of the results. This is seen in the timeline chart above — there are two huge spikes in progress (one in mid April and one a few weeks later). These spikes are representative of migrations for two teams, but they represent an outsized proportion of the overall status. Keep this in mind when prioritizing which systems to migrate, as sometimes spending a lot of time just to migrate one team or system could have a disproportionate payoff.

Finally, anticipate issues but be ready to adapt. Spend time early in the process thinking through your edge cases, but leave yourself extra time on the roadmap to account for issues that you could not predict. A little bit of buffer goes a long way for peace of mind and if you happen to deliver the results early, that’s a great win to celebrate!

Acknowledgements

This work would not have been possible without a huge group of people working together over the past few years to build the best system possible.

Huge thanks to our partners on the Traffic team for building out a robust configuration deployment system and onboarding us as the first large-scale production use case. Specifically, thanks to Tian Zhao who led most of our collaboration and was instrumental in getting our use-case onboarded. Additional thanks to Zhewei Hu, James Fish and Scott Beardsley.

The security team was also a huge help in reviewing the architecture, migration plans and pull-requests. Specifically Teagan Todd was a huge help in running many of these migrations. Also Yuping Li, Kevin Hock and Cedric Staub.

When encountering issues with older systems, Anh Nguyen was a massive help in upgrading systems under the hood.

Finally, thank you to our partners on teams that owned a large amount of policies, as they helped us push the migration forward by performing their own migrations: Aneesh Nelavelly, Vivian Huang, James Fraser, Gabriel Raphael Garcia Montoya, Liqi Yi (He Him), Qi LI, Mauricio Rivera and Harekam Singh.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.