How PayPal Moves Secure and Encrypted Data Across Security Zones

Jay Sen
The PayPal Technology Blog
5 min readMar 17, 2021

--

PayPal, like other large companies, has many data centers, regions, and zones with different security levels and restrictions to protect data. This makes data movement a not-so-easy task, and sets a high bar for maintaining security and data protection compliance.

The Data Movement Platform (DMP) team of Enterprise Data Platform (EDP) needed to build a fast and reliable data movement channel to offer to its customers within and outside EDP. This channel needed to be secure, fully InfoSec compliant. Last year, we started the journey to build our next-gen data movement platform. This blog is specifically about the security aspect of the platform.

At PayPal, security is our top priority, and that entails data protection.

Traditionally, to move data across zones, various teams have been using either DropZone — a Secure-FTP-based platform that’s built in-house, or Kafka — an easily available service by the Kafka team at PayPal. DropZone provides secure and reliable data storage for offline use-cases, whereas Kafka provides a fast data highway for more real-time use-cases. Both have their limitations when moving large datasets and require additional efforts to develop for the producer and consumer. These platforms are not suitable for batch data movement and suffer from burst data ingestion problems, not to mention an additional hop of (intermediary) data storage. Apache Pulsar specifically tries to solve this by exposing the storage layer interface for direct analytics.

What does it mean to move data securely?

Providing secure data movement means complying with the following InfoSec rules:

  1. All connections must be secured using TLS 1.2.
  2. Only a higher security zone can initiate the connection to a lower security zone — not vice-versa.
  3. Data at rest must be encrypted, with no unauthorized and unaudited access.
  4. All authorizations must be provided via either IAM or Kerberos.
  5. All secrets must be managed by KeyMaker — a key-management service.

How we incorporated these principles within the DMP:

The DMP primarily uses Hadoop as its execution platform to achieve high availability, reliability and resiliency for both computing and storage. Hence, both Apache Gobblin and Hadoop need to support all security measures. Let’s look at the high-level deployment architecture.

High-level Hadoop cluster setup in multiple zones with key distribution center (KDC), Key Management Service (KMS), and Apache Ranger.

Securing communication — HTTPS and sockets

Hadoop, in addition to regular socket addresses, provides secure socket addresses for each service to interact. When used, the connection can be encrypted over TLS 1.2. More info on secure ports can be found here.

Securing Authorizations — via Kerberos

Securing a group of Hadoop clusters is a relatively complex task, involving multiple architectural decisions. It’s much more complex than securing a single Hadoop cluster. There are two choices here:

1. Create trust between all different KDC servers located in different zones, so they can trust each others TGT (tickets) and DT (tokens) with the same security standards for communications.

2. Use a single KDC for both Hadoop clusters to keep authentication and authorization centralized.

We went ahead with the second approach, where we manage a single central KDC which provides all the required security for all Hadoop clusters, instead of managing multiple KDC servers per each Hadoop cluster.

We run Apache Gobblin on Yarn mode. Currently, Apache Gobblin does not support in-built token management for the multiple remote Hadoop environment. We added this feature with GOBBLIN-1308.

Securing data — using Transparent Data Encryption (TDE)

The Hadoop Platform natively supports a data encryption feature called TDE. Once TDE is enabled, all backend services encrypt and decrypt data transparently without requiring modifications to client code. With the right security configurations, it will work seamlessly.

TDE brings a lot of challenges, especially in a multi-cluster Hadoop environment across zones protected by firewalls. It adds to the complexity of token management, KMS configurations, and WebHDFS client API calls, etc…

Managing tokens— delegation token management for authentication & authorization

TDE is usually enabled with Kerberos. This makes token management even more complex for applications because TDE requires additional token — KMS, a cryptographic key management server that provides tokens for encryption/decryption to read/write data to underlying storage — HDFS. We can also define encryption zones for TDE to be applicable via Apache Ranger policies.

KMS setup: Hadoop 2.7.x has several known bugs:

  • HADOOP-14441— The KMS delegation token does not ask all KMS servers. As a result, a renew request to the KMS server fails if it is not the issuer. The expected behavior for the KMS service is to figure this out internally.
  • HADOOP-14445— The KMS does not provide high availability (HA) service, and hence tokens issued by one KMS server cannot be authenticated by another KMS instance of the same HA pool. Since this fix is only available in > Hadoop 2.8.4, we had to renew the KMS token with all KMS servers and ignore the failed operations, assuming we hit the right one from the list of available KMS servers.
  • HADOOP-15997 & HADOOP-16199 — When a token renews, it does not necessarily go to the right issuer. For this to work, we had to include all issuers in the local Hadoop config. This is due to the token class structure where it does not hold and use the right issuer or UGI while performing token.renew() call as described in the jira tickets.

An alternative to the WebHDFS API: With this highly secured environment, we have learned that it is not recommended to use the WebHDFS API, especially when both Kerberos and TDE are enabled on Hadoop 2.7.x. Hadoop 2.7.x has an implementation flaw that can possibly misuse HDFS superuser to access those “Encrypted Zones”. WebHDFS also has known issue of possibly stalling the NameNode. These can potentially create breaches or incidents, so we created a replacement WebHDFS-like API to overcome such challenges. Hopefully, Hadoop 3.x is going to take care of these issues.

Overall, this setup contributed to the most secure way of moving data across multiple Hadoop clusters between multiple zones. This provides the backbone for the batch DMP and enables multiple internal customer use-cases at PayPal.

It was a bit of an effort, but it was well worth learning the internal workings of Hadoop and network security. Kudos to PayPal’s Hadoop SRE and InfoSec team for partnering on this journey.

Teams:

Data Movement Platform Team: Jay Senjaliya, Radha Ramasubramanian, Sathish Srinivasan, Anisha Nainani, Rahul Kalita, Jyothsna Kullatira, Sanjiv Prabhunandan, Afroz K. (PM)

Information Security Team: Skip Hathorne, Greg King, Scott Van Schoyck

Hadoop SRE: Raghu Agani, Elliott Brown, William Wang

Engineering Leadership: Sudhir Rao, Bala Natarajan, Prasanna Krishna, Michael Zeltser, Sarah Brydon

--

--

Jay Sen
The PayPal Technology Blog

Build data platform to make data actionable and fast !