Securely Scaling Big Data Access Controls At Pinterest

Pinterest Engineering
Pinterest Engineering Blog
14 min readJul 25, 2023

--

Soam Acharya | Data Engineering Oversight; Keith Regier | Data Privacy Engineering Manager

Background

Businesses collect many different types of data. Each dataset needs to be securely stored with minimal access granted to ensure they are used appropriately and can easily be located and disposed of when necessary. As businesses grow, so does the variety of these datasets and the complexity of their handling requirements. Consequently, access control mechanisms also need to scale constantly to handle the ever-increasing diversification. Pinterest decided to invest in a newer technical framework to implement a finer grained access control (FGAC) framework. The result is a multi-tenant Data Engineering platform, allowing users and services access to only the data they require for their work. In this post, we focus on how we enhanced and extended Monarch, Pinterest’s Hadoop based batch processing system, with FGAC capabilities.

FGAC Design Principles

Pinterest stores a significant volume of non-transient data in S3. Our original approach to restricting access to data in S3 used dedicated service instances where different clusters of instances were granted access to specific datasets. Individual Pinterest data users were granted access to each cluster when they needed access to specific data. We started out with one Monarch cluster whose workers had access to existing S3 data. As we built new datasets requiring different access controls, we created new clusters and granted them access to the new datasets.

The Pinterest Data Engineering team provides a breadth of data-processing tools to our data users: Hive MetaStore, Trino, Spark, Flink, Querybook, and Jupyter to name a few. Every time we created a new restricted dataset we found ourselves needing to not just create a new Monarch cluster, but new clusters across our Data Engineering platform to ensure Pinterest data users had all of the tools they required to work with these new datasets. Creating this large number of clusters increased hardware and maintenance costs and took considerable time to configure. And fragmenting hardware across multiple clusters reduces the overall resource utilization efficiency as each cluster is provisioned with excess resources to handle sporadic surges in usage and requires a base set of support services. The rate at which we were creating new restricted datasets threatened to outrun the number of clusters we could build and support.

When building an alternative solution, we shifted our focus from a host-centric system to one that focuses on access control on a per-user basis. Where we previously granted users access to EC2 compute instances and those instances were granted access to data via assigned IAM Roles, we sought to directly grant different users access to specific data and run their jobs with their identity on a common set of service clusters. By executing jobs and accessing data as individual users, we could narrowly grant each user access to different data resources without creating large supersets of shared permissions or fragmenting clusters.

We first considered how we might extend our initial implementation of the AWS security framework to achieve this objective and encountered some limitations:

  • The limit on the number of IAM roles per AWS account is less than the number of users needing access to data, and initially Pinterest concentrated many of its analytics data in a small number of accounts, so creating one custom role per user would not be feasible within AWS limits. Additionally, the sheer number of IAM roles created in this manner would be difficult to manage.
  • The AssumeRole API allows users to assume the privileges of a single IAM Role on demand. But we need to be able to grant users many different permutations of access privileges, which quickly becomes difficult to manage. For example, if we have three discrete datasets (A, B, and C) each in their own buckets, some users need access to just A, while others will need A and B, etc. So we need to cover all seven permutations of A, A+B, A+B+C, A+C, B, B+C, C without granting every user access to everything. This requires building and maintaining a large number of IAM Roles and a system that lets the right user assume the right role when needed.

We discussed our project with technical contacts at AWS and brainstormed approaches, looking at alternate ways to grant access to data in S3. We ultimately converged on two options, both using existing AWS access control technology:

  1. Dynamically generating a Security Token Service (STS) token via an AssumeRole call: a broker service can call the API, providing a list of session Managed Policies which can be used to assemble a customized and dynamic set of permissions on-demand
  2. AWS Request Signing: a broker service can authorize specific requests as they’re made by client layers

We chose to build a solution using dynamically generated STS tokens since we knew this could be integrated across most, if not all, of our platforms relatively seamlessly. Our approach allowed us to grant access via the same pre-defined Managed Policies we use for other systems and could plug into every system we had by replacing the existing default AWS credentials provider and STS tokens. These Managed Policies are defined and maintained by the custodians of individual datasets, letting us scale out authorization decisions to experts via delegation. As a core part of our architecture, we created a dedicated service (the Credential Vending Service, or CVS) to securely perform AssumeRole calls which could map users to permissions and Managed Policies. Our data platforms could subsequently be integrated with CVS in order to enhance them with FGAC related capabilities. We provide more details on CVS in the next section.

Credential Vending Service

While working on our new CVS-centered access control framework, we adhered to the following design tenets:

  • Access control had to be granted access to user or service accounts as opposed to specific cluster instances to ensure access control scaled without the need for additional hardware. Ad-hoc queries execute as the user who ran the query, and scheduled processes and services run under their own service accounts; everything has an identity we can authenticate and authorize. And the authorization process and outcomes are identical regardless of the service or instance used.
  • We wanted to re-use our existing Lightweight Directory Access Protocol (LDAP) as a secure, fast, distributed repository that’s integrated with all our existing Authentication and Authorization systems. We achieved this by creating LDAP groups. We add LDAP user accounts to map each user to one or more roles/permissions. Services and scheduled workflows are assigned LDAP service accounts which are added to the same LDAP groups.
  • Access to S3 resources is always allowed or denied through S3 Managed policies. Thus, the permissions we grant via FGAC can also be granted to non-FGAC capable systems, providing legacy and external service support. And it ensures that any form of S3 data access is protected.
  • Authentication (and thus, user identity) is performed via tokens. These are cryptographically signed artifacts created during the authentication process that are used to securely transport user or service “principal” identities across servers. Tokens have built-in expiration dates. The types of tokens we use include:
    i. Access Tokens:
    AWS STS, which grants access to AWS services such as S3.
    ii. Authentication Tokens:
    — OAuth tokens are used for human user authentication in web pages or consoles.
    Hadoop/Hive delegation tokens (DTs) are used to securely pass user identity between Hadoop, Hive and Hadoop Distributed File System (HDFS).
Figure 1: How Credential Vending Service Works

Figure 1 demonstrates how CVS is used to handle two different users to grant access to different datasets in S3.

  1. Each user’s identity is passed through a secure and validatable mechanism (such as authentication tokens) to the CVS
  2. CVS authenticates the user making the request. A variety of authentication protocols are supported including mTLS, oAuth, and Kerberos.
  3. CVS starts assembling each STS token using the same base IAM Role. This IAM Role on its own has access to all data buckets. However, this IAM role is never returned without at least one modifying policy attached.
  4. The user’s LDAP groups are fetched. These LDAP groups assign roles to the user. CVS maps these roles to one or more S3 Managed Policies which grant access for specific actions (eg. list, read, write) on different S3 endpoints.
    a. User 1 is a member of two FGAC LDAP groups:
    i. LDAP Group A maps to IAM Managed Policy 1
    — This policy grants access to s3://bucket-1
    ii. LDAP Group B maps to IAM Managed Policies 2 and 3
    — Policy 2 grants access to s3://bucket-2
    — Policy 3 grants access to s3://bucket-3
    b. User 2 is a member of two FGAC LDAP groups:
    i. LDAP Group A maps to IAM Managed Policy 1 (as it did for the first user)
    — This policy grants access to s3://bucket-1
    ii. LDAP Group C maps to IAM Managed Policy 4
    — This policy grants access to s3://bucket-4
  5. Each STS token can ONLY access the buckets enumerated in the Managed Policies attached to the token.
    a. The effective permissions in the token are the intersection or permissions declared in the base role and the permissions enumerated in attached Managed Policies
    b. We avoid using DENY in Policies. ALLOWs can stack to add permissions to new buckets. But A single DENY overrides all other ALLOW access stacking to that URI.

CVS will return an error response if the authenticated identity provided is invalid or if the user is not a member of any FGAC recognized LDAP groups. CVS will never return the base IAM role with no Managed Policies attached, so no response will ever get access to all FGAC-controlled data.

In the next section, we elaborate how we integrated CVS into Hadoop to provide FGAC capabilities for our Big Data platform.

Pinterest FGAC Hadoop Platform

Figure 2: Original Pinterest Hadoop Platform

Figure 2 provides a high level overview of Monarch, the existing Hadoop architecture at Pinterest. As described in an earlier blog post, Monarch consists of more than 30 Hadoop YARN clusters with 17k+ nodes built entirely on top of AWS EC2. Monarch is the primary engine for processing both heavy interactive queries and offline, pre-scheduled batch jobs, and as such is a critical part of the Pinterest data infrastructure, processing petabytes and hundreds of thousands of jobs daily. It works in concert with a number of other systems to process these jobs and queries. In brief, jobs enter Monarch in one of two ways:

  • Ad hoc queries are submitted via QueryBook, a collaborative, GUI-based open source tool for big data management developed at Pinterest. QueryBook uses OAuth to authenticate users. It then passes on the query to Apache Livy which is actually responsible for creating and submitting a SparkSQL job to the target Hadoop cluster. Livy keeps track of the submitted job, passing on its status and console output back to QueryBook.
  • Batch jobs are submitted via Pinterest’s Airflow-based job scheduling system. Workflows undergo a mandatory set of reviews during the code repository check-in process to ensure correct levels of access. Once a job is being managed by Spinner, it uses the Job Submission Service to handle the Hadoop job submission and status check logic.

In both cases, submitted SparkSQL jobs work in conjunction with the Hive Metastore to launch Hadoop Spark applications which determine and implement the query plan for each job. Once running, all Hadoop jobs (Spark/Scala, PySpark, SparkSQL, MapReduce) read and write S3 data via the S3A implementation of the Hadoop filesystem API.

CVS formed the cornerstone of our approach to extending Monarch with FGAC capabilities. With CVS handling both the mapping of user and service accounts to data permissions and the actual vending of access tokens, we faced the following key challenges when assembling the final system:

  • Authentication: managing user identity securely and transparently across a collection of heterogeneous services
  • Ensuring user multi-tenancy in a safe and secure manner
  • Incorporating credentials dispensed by CVS into existing S3 data access frameworks

To address these issues, we extended existing components with additional functionality but also built new services to fill in gaps when necessary. Figure 3 illustrates the resulting overall FGAC Big Data architecture. We next provide details on these system components, both new and extended, and how we used them to address our challenges.

Figure 3: Pinterest FGAC Hadoop Platform

Authentication

When submitting interactive queries, QueryBook continues to use OAuth for user authentication. Then that OAuth token is passed by QueryBook down the stack to Livy to securely pass on the user identity.

All scheduled workflows intended for our FGAC platform must now be linked with a service account. Service accounts are LDAP accounts that do not allow interactive login and instead are impersonated by services. Like user accounts, service accounts are members of various LDAP groups granting them access roles. The service account mechanism decouples workflows from employee identities as employees often only have access to restricted resources for a limited time. Spinner extracts the service account name and passes it to the Job Submission Service (JSS) to launch Monarch applications.

We use the Kerberos protocol for secure user authentication for all systems downstream from QueryBook and Spinner. While we investigated other alternatives, we found Kerberos to be the most suitable and extensible for our needs. This, however, did necessitate extending a number of our existing systems to integrate with Kerberos and building/setting up new services to support Kerberos deployments.

Integrating With Kerberos

We deployed a Key Distribution Center (KDC) as our basic Kerberos foundation. When a client authenticates with the KDC, the KDC will issue a Ticket Granting Ticket (TGT), which the client can use to authenticate itself to other Kerberos clients. TGTs will expire and long running services must periodically authenticate themselves to the KDC. To facilitate this process, services typically use keytab files stored locally to maintain their KDC credentials. The volume of services, instances, and identities requiring keytabs is too large to manually maintain and necessitated the creation of a custom Keytab Management Service. Clients on each service make mTLS calls to fetch keytabs from the Keytab Management Service, which creates and serves them on demand. Keytabs constitute potential security risks that we mitigated as follows:

  • Access to nodes with keytab files are limited to service personnel only
  • mTLS configuration limits the nodes the Keytab Management Service responds to and the keytabs they can fetch
  • All Kerberos authenticated endpoints are restricted to a closed network of Monarch services. External callers use broker services like Apache Knox to convert OAuth outside Monarch to Kerberos auth inside Monarch, so Keytabs have little utility outside Monarch.

We integrated Livy, JSS, and all the other interoperating components such as Hadoop and the Hive Metastore with the KDC, so that user identity could be interchanged transparently across multiple services. While some of these services, like JSS, required custom extensions, others support Kerberos via configuration. We found Hadoop to be a special case. It is a complex set of interconnected services and while it leverages Kerberos extensively as part of its secure mode capabilities, turning it on meant overcoming a set of challenges:

  • Users don’t directly submit jobs to our Hadoop clusters. While both JSS and Livy run under their own Kerberos identity, we configure Hadoop to allow them to impersonate other Kerberos users to submit jobs on behalf of other users and service accounts.
  • Each Hadoop service must be able to access their own keytab file.
  • Both user jobs and Hadoop services must now run under their own Unix accounts. For user jobs, this necessitated:
  • Integrating our clusters with LDAP to create user and service accounts on the Hadoop worker nodes
  • Configuring Hadoop to translate the Kerberos identities of submitted jobs into the matching unix accounts
  • Ensuring Hadoop datanodes run on privileged ports
  • The YARN framework uses LinuxContainerExecutor when launching worker tasks. This executor ensures the worker task process is running as the user that submitted the job and restricts users to accessing only their own local files and directories on workers.
  • Kerberos is finicky about fully qualified host and service names, which required a significant amount of debugging and tracing to configure correctly.
  • While Kerberos allows communication over both TCP and UDP, we found mandating TCP usage helped avoid internal network restrictions on UDP traffic.

User Multi-tenancy

In secure mode, Hadoop provides a number of protections to enhance isolation between multiple user applications running on the same cluster. These include:

  • Enforcing access protections for files kept on HDFS by applications
  • Data transfers between Hadoop components and DataNodes are encrypted
  • Hadoop Web UIs are now restricted and require Kerberos authentication. SPNEGO auth configuration on clients was undesirable and required broader keytab access. Instead, we use Apache Knox as a gateway translating our internal OAuth authentication into Kerberos authentication to seamlessly integrate Hadoop Web UI endpoints with our intranet
  • Monarch EC2 instances are assigned to IAM Roles with read access set to a bare minimum of AWS resources. A user attempting to escalate privileges to that of the root worker will find they have access to fewer AWS capabilities than they started with.
  • AES based RPC encryption for Spark applications.

Taken together, we found these measures to provide an acceptable level of isolation and multi-tenancy for multiple applications running on the same cluster.

S3 Data Access

Monarch Hadoop accesses S3 data via the S3A filesystem implementation. For FGAC the S3A filesystem has to authenticate itself with CVS, fetch the appropriate STS token, and pass this on S3 requests. We accomplished this via a custom AWS credentials provider as follows:

  • This new provider authenticates with CVS. Internally, Hadoop uses delegation tokens as a mechanism to scale Kerberos authentication. The custom credentials provider securely sends the current application’s delegation token to CVS and the user identity of the Hadoop job.
  • CVS verifies the validity of the delegation token it has received by contacting the Hadoop NameNode via Apache Knox, and validates it against the requested user identity
  • If authentication is successful CVS assembles an STS token with the Managed Policies granted to the user and returns it.
  • The S3A file system uses the user’s STS token to authenticate calls to the S3 file system.
  • The S3 file system authenticates the STS token and authorizes or rejects the requested S3 actions based on the collection of permissions from the attached Managed Policies
  • Authentication failures at any stage result in a 403 error response.

We utilize in-memory caching on clients in our custom credentials provider and on the CVS servers to reduce the high frequency of S3 accesses and token fetches down to a small number of AssumeRole calls. Caches expire after a few minutes to respond quickly to permissions changes, but this short duration is enough to reduce downstream load by several orders of magnitude. This avoids exceeding AWS rate limits and reduces both latency and load on CVS servers. A single CVS server is sufficient for most needs, with additional instances deployed for redundancy.

Conclusion and Next Steps

The FGAC system has been an integral part of our efforts to protect data in an ever changing privacy landscape. The system’s core design remains unchanged after three years of scaling from the first use-case to supporting dozens of unique access roles from a single set of service clusters. Data access controls have continued to increase in granularity with data custodians easily authorizing specific use-cases without costly cluster creation while still using our full suite of data engineering tools. And while the flexibility of FGAC allows for grant management of any IAM resource, not just S3, we are currently focusing on instituting our core FGAC approaches into building Pinterest’s next generation Kubernetes based Big Data Platform.

A project of this level of ambition and magnitude would only be possible with the cooperation and work of a number of teams across Pinterest. Our sincerest thanks to them all and to the initial FGAC team for building the foundation that made this possible: Ambud Sharma, Ashish Singh, Bhavin Pathak, Charlie Gu, Connell Donaghy, Dinghang Yu, Jooseong Kim, Rohan Rangray, Sanchay Javeria, Sabrina Kavanaugh, Vedant Radhakrishnan, Will Tom, Chunyan Wang, and Yi He. Our deepest thanks also to our AWS partners, particularly Doug Youd and Becky Weiss, and special thanks to the project’s sponsors, David Chaiken, Dave Burgess, Andy Steingruebl, Sophie Roberts, Greg Sakorafis, and Waleed Ojeil for dedicating their time and that of their teams to make this project a success.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore life at Pinterest and apply to open roles, visit our Careers page.

--

--