Running Unified PubSub Client in Production at Pinterest

Pinterest Engineering

Published in

Pinterest Engineering Blog

9 min readNov 7, 2023

Jeff Xiang | Software Engineer, Logging Platform

Vahid Hashemian | Software Engineer, Logging Platform

Jesus Zuniga | Software Engineer, Logging Platform

At Pinterest, data is ingested and transported at petabyte scale every day, bringing inspiration for our users to create a life they love. A central component of data ingestion infrastructure at Pinterest is our PubSub stack, and the Logging Platform team currently runs deployments of Apache Kafka and MemQ. Over the years, operational experience has taught us that our customers and business would greatly benefit from a unified PubSub interface that the platform team owns and maintains, so that application developers can focus on application logic instead of spending precious hours debugging client-server connectivity issues. Value-add features on top of the native clients can also help us achieve more ambitious goals for dev velocity, scalability, and stability. For these reasons, and others detailed in our original PubSub Client blog post, our team has decided to invest in building, productionalizing, and most recently open-sourcing PubSub Client (PSC).

In the 1.5 years since our previous blog post, PSC has been battle-tested at large scale in Pinterest with notably positive feedback and results. From dev velocity and service stability improvements to seamless migrations from native client to PSC, we would like to share some of our findings from running a unified PubSub client library in production.

Dev Velocity Improvements

In a distributed PubSub environment, complexities related to client-server communication can often be hard blockers for application developers, and solving them often require a joint investigation between the application and platform teams. One of the core motivations driving our development of PSC was to hide these complexities from application developers so that precious time spent on debugging such issues can instead be used to focus on the application logic itself.

Highlights

Full automation in PubSub service endpoint discovery
Estimated 80% reduction in time spent for setting up new PubSub producers and consumers
Optimized client configurations managed by platform team

How We Did It

Automated Service Discovery

PSC offers a simple and familiar solution to automate PubSub service endpoint discovery, which hides those complexities away from application developers. Through the introduction of Resource Names (RNs), PubSub resources (e.g. topics) are now uniquely identified with an RN string that contains all the information PSC needs in order to establish a connection with the servers that contain the resource in question. This is a similar concept to Internet URIs and Amazon ARNs. For example,

secure:/rn:kafka:prod:aws_us-west-1:shopping:transaction

is an RN that tells PSC exactly which topic, cluster, region, and PubSub backend the client needs to connect to. Furthermore, the protocol in front of the RN creates a complete Unique Resource Identifier (URI), letting PSC know exactly how the connection should be established.

This simplification stands in stark contrast to some of the common pitfalls using the native client, such as hardcoding potentially invalid hostname/port combinations, scattering SSL passwords across client configurations, and mistakenly connecting to a topic in the wrong region. With endpoint discovery fully automated and consolidated, client teams rarely / never report these issues that used to require time-consuming investigations from our platform team.

Optimized Configurations and Tracking

Prior to productionalizing PSC, application developers were required to specify their own client configurations. With this liberty came issues, notably:

Some client-specified configurations may cause performance degradation for both client and server
Application developers may have a limited understanding of each configuration and their implications
Platform team had no visibility into what client configurations are being used

At Pinterest, PSC comes out-of-the-box for our users with client configurations that are optimized and standardized by the platform team, reducing the need for application developers to specify individual configurations that they would have otherwise needed to perform in-depth research into during configuration / application tuning. Instead, application developers are now focusing on tuning only the configurations that matter to them, and our platform team has spent significantly less time investigating performance / connectivity issues that came with client misconfigurations.

PSC takes it one step further with config logging. Having psc.config.logging.enabled=true turned on, our platform team now has further insights into the client configurations used across the PubSub environment in real time.

These features amount to not only significant dev velocity improvements but also gains in stability and reliability of our PubSub services.

Stability & Scalability Improvements

Highlights

>80% reduction in Flink application restarts caused by remediable client exceptions
Estimated 275+ FTE hours / year saved in KTLO work by application and platform teams

How We Did It

Prior to PSC, client applications often encountered PubSub-related exceptions that resulted in application failure or restarts, severely impacting the stability of business-critical data jobs and increasing KTLO burden for both platform and application teams. Furthermore, many of these exceptions were resolvable via a client reset or even just a simple retry, meaning that the KTLO burden caused by these issues was unnecessarily large.

For instance, we noticed that out-of-sync metadata between client and server can happen during regular Kafka cluster maintenance and scaling operations such as broker replacements and rolling restarts. When the client and server metadata go out-of-sync, the client begins to throw related exceptions and becomes unstable, and does not self-recover until the client is reconstructed or reset. These types of auto-remediable issues threatened our ability to scale PubSub clusters efficiently to meet business needs, and caused significant KTLO overhead for all teams involved.

Automated Error Handling

To combat these risks, we implemented automated error handling within PSC. Positioned between the native client and application layers, PSC has the unique advantage of being able to catch and remediate known exceptions thrown by the backend client, all without causing disruption to the application layer.

With automated error handling logic implemented, we also ship PSC with psc.auto.resolution.enabled=true turned on by default, allowing all PubSub clients to run out-of-the-box with automated error handling logic managed by our platform team. Taking Flink-Kafka clients as an example, we have observed more than 80% reduction in job failures caused by remediable client exceptions after migrating them to PSC, all without any changes to our regular Kafka broker environment and scaling / maintenance activities:

Figure 1: More than 80% of remediable Flink job restarts were prevented via PSC

As a result of automated error handling in PSC, we have been able to save more than ~275 FTE hours per year in KTLO work for both application and platform teams, driving significant improvements in the stability of client applications and scalability of our PubSub environment. We are also actively adding to PSC’s catalog of known exceptions / remediation strategies as we expand our understanding of these issues, as well as exploring options to take proactive instead of reactive measures to prevent such issues from happening in the first place.

Seamless Migrations from Native Client

Highlights

>90% of Java applications migrated to PSC (100% for Flink)
0 incidents caused by migration
Full integration test suite and CICD pipeline

How We Did It

Feature and API Parity

Built with ease-of-adoption in mind, PSC comes with 100% feature and API parity to the native backend client version it supports. With PSC being currently available for Kafka clients using Java, we have been able to migrate >90% of Pinterest’s Java applications to PSC with minimal changes to their code and logic. In most cases, the only changes required on the application were:

Replace the native client imports and references with the corresponding PSC ones
Update the client configuration keys to match PSC’s
Remove all previous configurations related to service discovery / SSL and replace them with just the Resource Name (RN) string

Simple, low effort migrations enabled by feature and API parity has been a strong selling point for application teams to quickly and efficiently migrate their clients to PSC. We have observed 0 incidents so far with migrations and do not expect this number to increase.

Apache Flink Integration

To support the ever-growing share of clients using Apache Flink data streaming framework, we have developed a Flink-PSC connector that allows Flink jobs to leverage the benefits of PSC. Given that around 50% of Java clients at Pinterest are on Flink, PSC integration with Flink was key to achieving our platform goals of fully migrating Java clients to PSC.

With Flink jobs, we had to ensure that migrations from Flink-Kafka to Flink-PSC were seamless in that the newly migrated Flink-PSC jobs must be able to recover from checkpoints generated by the pre-migration Flink-Kafka jobs. This is critical to Flink migrations due to the fact that Flink jobs store offsets and a number of other state-related information within the checkpoint files. This presented a technical challenge that required opening up the Flink-Kafka checkpoint files, understanding its contents, and understanding the way the contents are processed by Flink source and sink operators. Ultimately, we were able to achieve 100% adoption of Flink-PSC at Pinterest with the following efforts:

We implemented Kafka to PSC checkpoint migration logic within FlinkPscProducer and FlinkPscConsumer to ensure that state and offset information from the pre-migration Flink-Kafka checkpoint is recoverable via a Flink-PSC job
We added a small amount of custom code in our internal release of Flink-Kafka connector to ensure Flink-Kafka and Flink-PSC checkpoints are deemed compatible from the perspective of Flink’s internal logic

Robust Integration Tests and CICD

With PSC being in the active path for data ingestion and processing at Pinterest, we have taken extra care to ensure that it is robustly tested on all levels prior to releases, notably in integration testing and dev / staging environment testing. For this reason, PSC comes out-of-the-box with a full integration test suite that covers many common scenarios that we have observed in our PubSub operational experience. Furthermore, we have cataloged the public APIs within both PscConsumer and PscProducer to create a CICD pipeline that launches a PSC client application processing production-level traffic and touches all of the public API’s. Robust integration testing and CICD, alongside expansive unit test coverage, have been instrumental in building our confidence in PSC’s ability to take on business-critical data workloads from day one.

Future Work

Having been battle-tested at scale for over one year, PSC is now a core piece of the puzzle within Pinterest’s data infrastructure. There is more work planned for the future, aiming to increase its technical capability and value to our business.

Error Handling Improvements

As PSC is onboarded to more client applications, we began to notice and catalog the variety of remediable errors that PSC currently does not have the capability to automatically resolve, and we are actively adding these capabilities to PSC with each new release. One example is to detect expiring SSL certificates so that a proactive client reset can be done upon approaching certificate expiration to load a fresh certificate into the client’s memory and prevent any interruptions to a client using SSL protocol.

Cost Attribution and Chargeback

PSC offers us the ability to track our clients, providing valuable information such as their attributed projects, hostnames, configurations, and more. One potential use case for this newfound visibility is to set up a chargeback framework for PubSub clients so that platform teams are able to break down how their PubSub cost can be attributed to various client projects and teams.

C++ and Python

PSC is currently available in Java. To expand the scope of PSC, C++ support is being actively developed while Python support is on the horizon.

Check it out!

PSC-Java is now open-sourced on GitHub with Apache License 2.0. Check it out here! Feedback and contributions are welcome and encouraged.

Acknowledgements

The current state of PSC would not have been possible without significant contributions and support provided by Shardul Jewalikar and Ambud Sharma. Ping-Min Lin has also contributed substantially to design and implementation of the project. Special thanks to Logging Platform and Xenon Platform Teams, Chunyan Wang and Dave Burgess for their continuous guidance, feedback, and support.

Disclaimer

Apache®️, Apache Kafka, Kafka, Apache Flink, and Flink are trademarks of the Apache Software Foundation.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.

Running Unified PubSub Client in Production at Pinterest

Dev Velocity Improvements

Highlights

How We Did It

Stability & Scalability Improvements

Highlights

How We Did It

Seamless Migrations from Native Client

Highlights

How We Did It

Future Work

Check it out!

Acknowledgements

Disclaimer

Written by Pinterest Engineering