Pinterest由美国加州帕罗奥图的一个名为Cold Brew Labs的团队营运，创办人为Ben Silbermann、 Paul Sciarra 及 Evan Sharp。2010年正式上线。“Pinterest”是由“Pin”及“interest”两个字组成，在社交网站中的访问量仅次于Facebook、Youtube、VKontakte以及Twitter。
Several years ago, Pinterest had a short incident due to oversights in the policy delivery engine. This engine is the technology that ensures a policy document written by a developer and checked into source control is fully delivered to the production system evaluating that policy, similar to OPAL. This incident began a multi-year journey for our team to rethink policy delivery and migrate hundreds of policies to a new distribution model. We shared details about our former policy delivery system in a conference talk from Kubecon 2019.
At a high level, there are three important architectural decisions we’d like to bring attention to for this story.
At Pinterest, our mission is to bring everyone the inspiration to create a life they love. People often come to Pinterest when they are considering what to do or buy next. Understanding this evolving user journey while balancing across multiple objectives is crucial to bring the best experience to Pinterest users and is supported by multiple recommendation models, with each providing real-time inference with an overall latency of 200–300 milliseconds. In particular, our machine learning powered ads ranking systems are trying to understand users’ engagement and conversion intent and promote the right ads to the right user at the right time. Our engineers are constantly discovering new algorithms and new signals to improve the performance of our machine learning models. A typical development cycle involves offline model training to realize offline model metric gains and then online A/B experiments to quantify online metric movements. However, it is not uncommon that offline metric gains do not translate into online business metric wins. In this blog, we will focus on some online and offline discrepancies and development cycle learnings we have observed in Pinterest ads conversion models, as well as some of the key platform investments Pinterest has made to minimize such discrepancies.
A Journey from GBDT to Multi-Task Ensemble DNN.
Better performance, lower cost and less code complexity.
Goku是Pinterest的内部时间序列数据库，用于监控和设置警报。他们改变了数据写入方式和摄取模型，采用基于拉取的、分片感知的摄取模型，并引入了Goku side Kafka。他们还使用本地磁盘和S3替代了EFS作为持久化数据和备份。这些改变使得GokuS的恢复时间从90-120分钟缩短到不到40分钟，提供了高效的查询路由。GokuL利用RocksDB进行时间序列数据存储，使用分层存储的方式，将较小和较新的SST文件在低层进行压缩，存储为较大和较旧的SST文件在高层。GokuL集群存储并提供超过一天的旧数据，这些数据的保留时间为1年。具体的数据分层策略和存储集群信息可以在GokuL博客和成本降低博客中找到。
At Pinterest, data is ingested and transported at petabyte scale every day, bringing inspiration for our users to create a life they love. A central component of data ingestion infrastructure at Pinterest is our PubSub stack, and the Logging Platform team currently runs deployments of Apache Kafka and MemQ. Over the years, operational experience has taught us that our customers and business would greatly benefit from a unified PubSub interface that the platform team owns and maintains, so that application developers can focus on application logic instead of spending precious hours debugging client-server connectivity issues. Value-add features on top of the native clients can also help us achieve more ambitious goals for dev velocity, scalability, and stability. For these reasons, and others detailed in our original PubSub Client blog post, our team has decided to invest in building, productionalizing, and most recently open-sourcing PubSub Client (PSC).
In the 1.5 years since our previous blog post, PSC has been battle-tested at large scale in Pinterest with notably positive feedback and results. From dev velocity and service stability improvements to seamless migrations from native client to PSC, we would like to share some of our findings from running a unified PubSub client library in production.
Modern compute platforms are foundational to accelerating innovation and running applications more efficiently. At Pinterest, we are evolving our compute platform to provide an application-centric and fully managed compute API for the 90th percentile of use cases. This will accelerate innovation through platform agility, scalability, and a reduced cost of keeping systems up to date, and will improve efficiency by running our users’ applications on Kubernetes-based compute. We refer to this next generation compute platform as PinCompute, and our multi-year vision is for PinCompute to run the most mission critical applications and services at Pinterest.
PinCompute aligns with the Platform as a Service (PaaS) cloud computing model, in that it abstracts away the undifferentiated heavy lifting of managing infrastructure and Kubernetes and enables users to focus on the unique aspects of their applications. PinCompute evolves Pinterest architecture with cloud-native principles, including containers, microservices, and service mesh, reduces the cost of keeping systems up to date by providing and managing immutable infrastructure, operating system upgrades, and graviton instances, and delivers costs savings by applying enhanced scheduling capabilities to large multi-tenant Kubernetes clusters, including oversubscription, bin packing, resource tiering, and trough usage.
In this article, we discuss the PinCompute primitives, architecture, control plane and data plane capabilities, and showcase the value that PinCompute has delivered for innovation and efficiency at Pinterest.
In this blog, we present a pragmatic way of integrating analytics, written in Python, with our distributed anomaly detection platform, written in Java. The approach here could be generalized to integrate processing done in one language/paradigm into a platform in another language/paradigm.
To support metrics reporting for ads from external advertisers and real-time ad budget calculations at Pinterest, we run streaming pipelines using Apache Flink. These jobs have guaranteed an overall 99th percentile availability to our users; however, every once in a while some tasks get hit with nasty direct out-of-memory (OOM) errors on multiple operators.
Pinterest’s mission is- to bring everyone the inspiration to create a life they love. The closeup team helps with this mission by providing a feed of relevant and context-and-user-aware recommendations when a Pinner closes up on any Pin.
The recommendations are powered by innovative and cutting-edge machine learning technologies. We have published a detailed blog post of its modeling architecture. While adopting the newest architectures improves a model’s capabilities, building a solid training foundation stabilizes the model and further up-levels the model’s potential.
Training foundations cover a lot of aspects, from training preparation (training data logging, feature freshness, sampling strategies, hyperparameter tuning, etc), to training efficiency optimization (distributed training, model refreshes, GPU training, etc), to post training validation (offline replay, etc).
Pinterest’s mission as a company is to bring everyone the inspiration to create a life they love. “Everyone” has been the north star for our Inclusive AI and Inclusive Product teams. These teams work together to ensure algorithmic fairness, inclusive design, and representation are an integral part of our platform and product experience.
Our commitment is evidenced by our history of building products that champion inclusivity. In 2018, Pinterest announced the skin tone signal and skin tone ranges. In 2020, we announced the integration of skin tone ranges into Try on for Beauty. In 2021, we announced hair pattern search. In early 2023, we announced how we have been using our skin tone signal to shape our recommendations to increase skin tone representation across several surfaces. Now, we are expanding the latter to also include body type representation in fashion related results across search and closeup recommendations (AKA related feeds).
Our mission at Pinterest is to bring everyone the inspiration to create the life they love. Machine Learning plays a crucial role in this mission. It allows us to continuously deliver high-quality inspiration to our 460 million monthly active users, curated from billions of pins on our platform. Behind the scenes, hundreds of ML engineers iteratively improve a wide range of recommendation engines that power Pinterest, processing petabytes of data and training thousands of models using hundreds of GPUs.
Pinterest’s mission is to bring everyone the inspiration to create a life they love. We rely on an extensive suite of AI powered products to connect over 460M users to hundreds of billions of Pins, resulting in hundreds of millions of ML inferences per second, hundreds of thousands of ML training jobs per month by just a couple of hundreds of ML engineers.
In 2021, ML was siloed at Pinterest with 10+ different ML frameworks relying on different deep learning frameworks, framework versions, and boilerplate logic to connect with our ML platform. It was a major bottleneck for ML innovation at Pinterest because the amount of engineering resources spent by each ML team to maintain their own ML stack was immense and there was limited knowledge sharing across teams.
Businesses collect many different types of data. Each dataset needs to be securely stored with minimal access granted to ensure they are used appropriately and can easily be located and disposed of when necessary. As businesses grow, so does the variety of these datasets and the complexity of their handling requirements. Consequently, access control mechanisms also need to scale constantly to handle the ever-increasing diversification. Pinterest decided to invest in a newer technical framework to implement a finer grained access control (FGAC) framework. The result is a multi-tenant Data Engineering platform, allowing users and services access to only the data they require for their work. In this post, we focus on how we enhanced and extended Monarch, Pinterest’s Hadoop based batch processing system, with FGAC capabilities.
Time series is a critical part of Observability at Pinterest, powering 60,000 alerts and 5,000 dashboards. A time series is an identifier with values where the values are associated with a timestamp. Given the widespread use and critical nature of time series, it’s important to give engineers the ability to adequately express what operations to perform on the time series in a readable, understandable, and efficient manner. In this post, we will cover the background of time series at Pinterest, the goals of designing an expressive time series language, and some examples of how we are using this language today.