Unified PubSub Client at Pinterest

Vahid Hashemian | Software Engineer, Logging Platform
Jeff Xiang | Software Engineer, Logging Platform

At Pinterest, the Logging Platform team manages the PubSub layer and provides support for clients that interact with it. At the heart of the PubSub layer, there are two main systems responsible for ingress and egress of data in motion:

Apache Kafka® (Kafka hereafter)
MemQ

Over the last several years, we have learnt through operational experience that our customers and business needs to have reduced KTLO costs, and they want the platform team to own not just the servers / service but also the client / SDK as well as the on-call associated with any issues that arise from client-server connectivity so they can focus on application logic.

As a platform team, we have a desire to improve the scalability and efficiency of our platform (e.g. by improving PubSub systems) which in turn may require making rapid changes to the client / SDK.

In order to accomplish our objective of improving the Quality of Service (QoS) for the Logging Platform we decided to invest in and take ownership of PubSub clients / SDK. Using the native clients and building on top of them to add several value-add features, albeit a huge undertaking, would help solve our scalability, stability and dev velocity objectives. That is why we decided to implement a unified PubSub client library called PubSub Client (PSC).

Why Another Client?

Until about two years ago Kafka was the single PubSub system at Pinterest, and the majority of data pipelines still run on Kafka. Client applications use the Kafka client library to connect to Kafka clusters and produce or consume data. This direct dependency on Kafka clients presents its own challenges:

**1. Reduced Dev Velocity
**Client applications and Kafka brokers are not isolated enough due to direct dependencies between them. On one hand, client applications (users of the PubSub platform) can be easily impacted by changes on the broker side. For example, depending on the version of the client library in use, they could be impacted by broker upgrades.

On the other hand, the PubSub platform team has challenges of their own. They cannot simply go about upgrading Kafka brokers since potential impact to client applications is unknown, as there is no visibility into what client library versions are in use (see our earlier article).

The bottom line: there is a lack of clear ownership boundaries between the PubSub layer and the client application, which is due to tight dependency of applications to the PubSub client library.

**2. Reliability of Service
**Any PubSub client library comes with a comprehensive set of configurations that a typical application developer may not be fully familiar with. When proper client tracking means are not in place, the platform team has no visibility into which application is using what configuration. This not only can hinder applications performance due to potential misconfigurations but also can impact the PubSub platform in a negative way causing cascading issues to other applications that rely on the platform.

Another reliability issue we, as the PubSub platform team, are facing is the tight dependency of client applications to Kafka cluster “endpoints” (i.e. seed brokers). Although means for a dynamic discovery of cluster endpoints is available, we cannot simply enforce them, and we end up with outages of client applications that use hardcoded endpoints during a cluster maintenance, or broker decommissioning/replacement.

**3. Scalability of Service
**Our strategy is to make the underlying PubSub implementation nuances transparent to client applications. This transparency would mean a lot more freedom for the platform team at the PubSub layer:

Things such as broker maintenance, reconfigurations, scaling in/out or up/down, or seamless transfer of workloads from one PubSub to another (e.g. Kafka to MemQ, Kafka to Pulsar, or vice versa) can be done with a lot less worry about negative impact to client applications.
Client applications would no longer be strictly tied to limitations of a PubSub system or actions taken on it.

Without this transparency, scalability is heavily jeopardized. It just takes one application that is sensitive to certain PubSub maintenance actions to severely limit the platform team in fully automating maintenance operations. The reason for such sensitivity could be the application architecture/logic, particular usage of PubSub client libraries, or potential edge cases in the client library (e.g. static partitioning of the producer). Moving to a unified client library would unify the expectations across all client applications and help reduce these undesired situations.

A unified PubSub client library reduces the interdependency between client applications and the PubSub layer and offers:

Consistent metrics and alerts: Metrics/alert templates provide a consistent view into the PubSub layer.
Service discovery: Provided out of the box, instead of each client application having to deal with it.
Optimized configuration: Certain configurations (such as default values) can be auto enforced, and conflicting configurations can be auto-corrected.
Interception benefits: Numerous features such as chargeback and cost calculations, auditing, corrupt message detection and removal, auto error handling, etc. can be added in this middle layer.
Ease of PubSub switch: Since the client application relies on a unified set of PubSub APIs, switching from one PubSub to another is as easy as changing the topic reference. The library can internally handle and sync state (e.g. consumption offsets) before and after the switch.

There are a few reasons we chose to implement a 100% client-side solution rather than a proxy service:

A proxy service comes with overhead of additional infrastructure costs (10%-20% for us) as there is an additional hop where a copy of data is created.
A proxy service will increase the end to end latency from PubSub producers to consumers, something that would be unacceptable to some of our client use cases.
Due to the more complex nature of a proxy service (especially in comparison with state of the union at Pinterest), it is faster to productionalize a fully client-side library than a proxy service.

Despite the above reasons, since the internal implementation of PSC is transparent to client applications, there is no blocker for switching to or adding a proxy service in the future.

Architecture of PSC

The architecture of PSC is shown in Figure 1.

The main components of PSC are PSC Consumer Module and PSC Backend Consumer, PSC Producer Module and PSC Backend Producer. A backend client for each PubSub implements the Backend Consumer/Producer interfaces. There are add-on components such as metrics, service discovery, interceptors, etc.

Figure 1. PSC Architecture

Main PSC interfaces are PSC Producer and PSC Consumer. Each PSC Producer and Consumer could manage one or more backend (Kafka, MemQ, etc.) producer and consumer. Each backend consumer in the diagram is a PSC construct that directly uses the backend PubSub’s client library. For example, PSC Kafka Consumer uses the Kafka Consumer library.

Both PSC Producer and Consumer leverage common functionalities such as metrics collection and reporting, service discovery, environment awareness, ser/des, interceptors, etc.

Highlights

We had to make several design decisions to define how PSC should work. We review some of them in this section.

Topic URI

As mentioned earlier, one of the challenges we wanted to address with PSC was relieving application developers from having to solve cluster discovery and allocation problems. For this reason, we proposed the notion of Topic URI as a unique identifier of each PubSub topic. It may be in the form of a UUID or any other agreed upon format. This is an example for the resource name convention:

protocol:/rn:service:environment:cloud_region:cluster:topic.

For example:

plaintext:/rn:kafka:prod:aws_us-west-1:shopping:transaction

PSC Producer and Consumer use the discovery module of PSC to convert this topic URI to proper PubSub endpoints when necessary. Note that the security protocol used for consumption/production of messages is also included in the URI — this removes the need to repeatedly configure protocol-specific configs (e.g. SSL configs) on a per-application basis, which streamlines the client onboarding process and prevents misconfiguration.

PubSub Coverage Scope

By design, a PSC producer or consumer is not tied to any particular PubSub implementation or cluster at creation time. This makes it different from how, for example, a Kafka producer or consumer is configured with a bootstrap server at creation time. This means that one PSC producer or consumer can connect to multiple PubSub clusters of potentially different types.

Configuration

PSC configurations cover common configurations from different PubSub systems and, at the same time, provide a pass-through option to inject a configuration related to a specific PubSub client.

PSC comes with its own dedicated set of configurations. These configurations do not need any translation as they are not related to any backend PubSub behavior. For example, configurations for metrics reporting, enabling/disabling auto resolution, etc.

APIs

Since Kafka is the primary PubSub technology at Pinterest and application developers are already familiar with its client libraries, we decided to remain as close as possible to Kafka APIs signature.

Pluggability

Given that PSC is meant to be a generic client library, it is important to offer pluggability as a first class feature, where additional PubSub client libraries or additional features can be onboarded. Pluggable modules in PSC include:

PubSub producer / PubSub consumer to onboard new PubSub client libraries.
Cluster endpoints discovery to onboard custom approaches to discovery for PubSub endpoints.
Environment provider to onboard custom environment spec implementations (e.g. AWS EC2). This helps with auto configuration of different aspects of the client functions (e.g. based on locality, instance size, etc.)
Interceptor to onboard custom modules to act upon sending / receiving messages.

Transactional Producer

Given that a PSC client connection to a PubSub cluster is not materialized at creation time but rather when a connection to the PubSub cluster is actually needed (e.g. when sending a messages or polling for messages), the state transitions of transactional PSC producers do not fully match that of transactional producers for PubSub systems such as Kafka, where a client is immediately attached to a single Kafka cluster upon initialization.

PSC clients work as a wrapper around PubSub client libraries (e.g. Kafka producer / consumer). While PSC clients are not directly tied to a PubSub cluster at creation time, backend clients are. Since the backend producer is the entity that directly depends on the PubSub producer (e.g. Kafka producer), its transactional state needs to be synchronized with the PubSub producer’s transactional state. Because of this, we had to implement state charts for PSC producer and the backend producer to work consistently with the PubSub producer (e.g. Kafka producer) when transactional APIs are involved.

Apache Flink Integration

Apache Flink®️ is the streaming platform at Pinterest supporting a growing number of use cases. In order to ensure wide adoption of PSC, we implemented a Flink PSC connector (based on Flink Kafka connector implementation). Since this connector uses topic URIs to reference PubSub topics, we implemented a translation mechanism so that Flink PSC producer and consumer can resume from snapshots (checkpoints, savepoints) generated by Flink Kafka producer and consumer.

Future Work

As the adoption of PSC is underway, we have started seeing the benefits we described earlier. However, we still need to accomplish several other milestones in our roadmap. One major benefit we and the application owners get from PSC is that we can easily make changes to internals of PSC to develop new features and roll those new features using a simple dependency update. Upcoming PSC features include:

Auto error handling and configuration updates: PSC can step in to auto update dependent configurations when needed. For example, when enabling Kafka producer’s idempotence, a few other configurations need to be set accordingly. Also, when certain errors are thrown from the backend client, PSC provides an additional layer where auto resolution logic can be implemented for such errors. For example, sometimes a client may fail to properly rotate its SSL certificate (we have seen Flink clients observing this issue). In situations like this, resetting the client is a far less disruptive option than bubbling up the exception to the application. PSC provides the error handling framework that ties exceptions to proper handling strategies (skip, throw, retry, reset, etc.).
Message auditing and corruption handling: With PSC architecture, this optional feature can be implemented using a producer/consumer interceptor to track lineage of data flow and where it gets corrupted along the way.
Dynamic client reconfiguration: Certain client configuration and settings can be optimized, for example, by analyzing the throughput (e.g. consumer’s poll timeout or producer’s batch size or batch duration can be continuously monitored and updated with changes in traffic pattern to optimize the health of a PubSub pipeline). We also plan to use a centralized configuration model to allow for pushing both global and specific configuration changes dynamically.
Seamless backend switch: When a PubSub topic is moved from one PubSub type to another, PSC should be able to migrate state, such as consumer offsets, as well.

We also plan to implement PSC in other languages such as C++, Python, and Go. We are also starting the process for open sourcing the project to engage the broad community.

Acknowledgement

The current state of PSC would not have been possible without continuous and significant design and technical support of Ambud Sharma. Ping-Min Lin has also contributed substantially to design and implementation of the project. Special thanks to Logging Platform and Xenon Platform Teams, Chunyan Wang and Dave Burgess for their continuous guidance, feedback, and support.

Disclaimer

Apache®️, Apache Kafka, Kafka, Apache Flink, and Flink are trademarks of the Apache Software Foundation.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog, and visit our Pinterest Labs site. To view and apply to open opportunities, visit our Careers page.