Pinterest由美国加州帕罗奥图的一个名为Cold Brew Labs的团队营运，创办人为Ben Silbermann、 Paul Sciarra 及 Evan Sharp。2010年正式上线。“Pinterest”是由“Pin”及“interest”两个字组成，在社交网站中的访问量仅次于Facebook、Youtube、VKontakte以及Twitter。
Since its inception, Pinterest’s philosophy has always been centered around data. As a data driven company, that means all data ingested is stored for further use. This looks like 600 terabytes of new data every day, encompassing over 500 petabytes of total data. At this scale, big data tooling plays a critical role in enabling our company to gather meaningful insights. This is where the workflow team comes in. We help facilitate over 4000 workflows, which produce 10,000 daily flow executions and 38,000 daily job executions on average.
In our last blog post, we discussed how we made the decision and took the actions to move from our legacy system, Pinball, to our new system, Spinner, which is built on top of the Apache Airflow project. As a reminder, this is based off of a custom branch that branched off of Airflow version 1.10-stable with some features cherry picked from the master branch.
In this post, we will explain how we approached and designed the migration, identified requirements, and coordinated with all our engineer teams to seamlessly migrate 3000+ workflows to Airflow. We will deep dive into trade offs made, but before we do that, we want to give our learnings.
Engineers hate migrations. What do engineers hate more than migrations? Data migrations. Especially critical, terabyte-scale, online serving migrations which, if done badly, could bring down the site, enrage customers, or cripple hundreds of critical internal services.
So why did the Key-Value Systems Team at Pinterest embark on a two-year realtime migration of all our online key-value serving data to a single unified storage system? Because the cost of not migrating was too high. In 2019, Pinterest had four separate key-value systems owned by different teams with different APIs and featuresets. This resulted in duplicated development effort, high operational overhead and incident counts, and confusion among engineering customers.
In unifying all of Pinterest’s 500+ key-value use cases (over 4PB of unique data serving 100Ms of QPS) onto one single interface, not only did we make huge gains in reducing system complexity and lowering operational overhead, we achieved a 40–90% performance improvement by moving to the most efficient storage engine, and we saved the company a significant amount in costs per year by moving to the most optimal replication and versioning architecture.
In this blog post, we selected three (out of many more) innovations to dive into that helped us notch all these wins.
Despite the explosive growth of the internet over the past couple of decades, much of the digitized knowledge has been curated for human understanding and has stayed unfriendly for machine comprehension. Even promising efforts towards creating semantic web like the Resource Description Framework in Attributes (RDFA), Ontology Web Language (OWL), JSON-LD, and Open Graph Protocol are in infancy and fall short for commercial applications due to data sparsity and high variance in data quality across websites. Hence Web Information Extraction (WIE), colloquially known as scraping, is the dominant knowledge acquisition strategy for several organizations in advertising, commerce, search engines, travel, etc. For our purposes, Pinterest uses this approach to bring high-level information (like price and product description) from saved websites to the Pin-level, to help provide Pinners with more information, along with a link back to the original website for more details, and to ultimately take action.
Reading both parts of this series will give you insight into some debugging techniques we are using in the Pinterest Engineering Key Value Systems team (a team derived from the previous Serving System). Related projects owned by this team can be seen in blogs and presentations on Terrapin, Rocksplicator (1 and 2), Aperture and Realpin.
Ideas fuel innovation. Innovation drives our product toward our mission of bringing everyone the inspiration to create a life they love. The speed of innovation is determined by how quickly we can get a signal or feedback on the promise of an idea so we can learn whether to pursue or pivot. Online experimentation is often used to evaluate product ideas, but it is costly and time-consuming. Could we predict experiment outcomes without even running an experiment? Could it be done in hours instead of weeks? Could we rapidly pick only the best ideas to run an online experiment? This post will describe how Pinterest uses offline replay experimentation to predict experiment results in advance.
In our efforts to shift left (in which testing is performed earlier, or moved left on the project timeline), this blog covers how we began running a large end-to-end UI test suite before every commit to our Android and iOS repositories. This project involved careful coordination of UI testing, test infrastructure, and developer productivity.
Like many companies, Pinterest sees an increase in traffic in the last three months of the year. We need to make sure our systems are ready for this increase in traffic so we don’t run into any unexpected problems. This is especially important as Pinners come to Pinterest at this time for holiday planning and shopping. Therefore, we do a yearly exercise of testing our systems with additional load. During this time, we verify that our systems are able to handle the expected traffic increase. On Druid we look at several checks to verify:
- Queries: We make sure the service is able to handle the expected increase in QPS while at the same time supporting the P99 Latency SLA our clients need.
- Ingestion: We verify that the real-time ingestion is able to handle the increase in data.
- Increase in Data size: We confirm that the storage system has sufficient capacity to handle the increased data volume.
In this post, we’ll provide details about how we run the holiday load test and verify Druid is able to handle the expected increases mentioned above.
As Pinterest continues to evolve from a place to just save ideas to a platform for discovering content that inspires action, there’s been an increase in native content from creators publishing directly to Pinterest. With the creator ecosystem on Pinterest growing, we’re committed to ensuring Pinterest remains a positive and inspiring environment through initiatives like the Creator Code, a content policy that enforces the acceptance of guidelines (such as “be kind” and “check facts”) before creators can publish Idea Pins. We also have guardrails in place on Idea Pin comments including positivity reminders, tools for comment removal and keyword filtering, and spam prevention signals. On the technical side, we use cutting edge techniques in machine learning to identify and enforce against community policy-violating comments in near real-time. We also use these techniques to surface the most inspiring and highest quality comments first in order to bring a more productive experience and drive engagement.
Since machine learning solutions were introduced in March to automatically detect potentially policy-violating comments before they’re reported and take appropriate action, we’ve seen a 53% decline in comment report rates (user comment reports per 1 million comment impressions).
Here, we share how we built a scalable near-real time machine learning solution to identify policy-violating comments and rank comments by quality.
Pinterest is a visual discovery engine that helps Pinners find inspirational ideas. Advertisers use Pinterest to connect with Pinners on these journeys to inspiration, and seek to promote products or services efficiently.
The Ads Intelligence team at Pinterest builds products that help advertisers maximize the value they get out of their ad campaigns. As part of that initiative, we have recently launched the Campaign Budget Optimization product for Pinterest Ads.
Campaign Budget Optimization, or CBO, is an automated ads product that benefits advertisers by distributing the advertising budget for each campaign across the underlying ad groups in an automated manner. The goal of Campaign Budget Optimization is to:
- Maximize advertiser value, for example driving clicks or conversions, depending on the campaign
- Improve the budget utilization of the campaign by allowing the budget to be shared across ad groups
- Simplify the advertiser experience and eliminate the need for manual budget adjustments
The Logging Platform powers all data ingestion and transportation at Pinterest. At the heart of the Pinterest Logging Platform are Distributed PubSub systems that help our customers transport / buffer data and consume asynchronously.
In this blog we introduce MemQ (pronounced mem — queue), an efficient, scalable PubSub system developed for the cloud at Pinterest that has been powering Near Real-Time data transportation use cases for us since mid-2020 and complements Kafka while being up to 90% more cost efficient.
Pinterest surfaces billions of ideas to people every day, and the neural modeling of embeddings for content, users, and search queries are key in the constant improvement of these machine learning-powered recommendations. Good embeddings — representations of discrete entities as vectors of numbers — enable fast candidate generation and are strong signals to models that classify, retrieve and rank relevant content.
We began our representation learning workstream with Visual Embeddings, a convolutional neural network (CNN) based Image representation, then moved toward PinSage, a graph-based multi-modal Pin representation. We expanded into more use cases such as PinnerSage, a user representation based on clustering a user’s past Pin actions, and have since worked with even more entities including search queries, Idea Pins, shopping items and content creators.
In this blog post we focus on SearchSage, our search query representation, and detail how we built and launched SearchSage for search retrieval and ranking to increase relevance of recommendations and engagement in search across organic Pins, Product Pins, and ads. Now used for 15+ use cases, this embedding is one of the most important features in both our organic and ads relevance models, and has led to metric wins such as an 11% increase in 35s+ click-throughs on product Pins in search, and a 42% increase in related searches.
Pinterest’s Batch Processing Platform, Monarch, runs most of the batch processing workflows of the company. At the scale shown in Table 1, it is important to manage the platform resources to provide quality of service (QoS) while achieving cost efficiency. This article shares how we do that and future work.
The Pinterest Ad Business has grown multi-fold in the past couple years, with respect to both advertisers and users. As we scale our revenue, it becomes imperative to:
- Distribute advertiser spend smoothly over the course of the day
- Avoid over-spending beyond the advertiser’s daily / lifetime budget
- Maximize advertiser value
Pinterest is a place where users (Pinners) can save and discover content from both web and mobile platforms, and where increasingly Creators can publish native content right to Pinterest. We hold billions of content (Pins) in our corpus and serve personalized recommendations that inspire Pinners to create a life they love. One of the key and most complicated surfaces for Pinterest is the home feed, where Pinners will see personalized feeds based on their engagement and interests. In this blog, we will discuss how we unify our light-weight scoring layer across the various candidate generators that power home feed recommendations.
In this blog post series, we are going to discuss Pinterest’s Analytics as a Platform on Druid and share some learnings on using Druid. This is the third of the blog post series, and will discuss learnings on optimizing Druid for real-time use cases.