Pinterest由美国加州帕罗奥图的一个名为Cold Brew Labs的团队营运，创办人为Ben Silbermann、 Paul Sciarra 及 Evan Sharp。2010年正式上线。“Pinterest”是由“Pin”及“interest”两个字组成，在社交网站中的访问量仅次于Facebook、Youtube、VKontakte以及Twitter。
In this blog post series, we’ll discuss Pinterest’s Analytics as a Platform on Druid and share some learnings on using Druid. This is the second of the blog post series, and will discuss learnings on optimizing Druid for batch use cases.
In this blog post series, we’ll discuss Pinterest’s Analytics as a Platform on Druid and share some learnings on using Druid. This is the first of the blog post series with a short history on switching to Druid, system architecture with Druid, and learnings on optimizing host types for Mmap.
We describe a novel approach we took to improving S3 read throughput and how we used it to improve the efficiency of our production jobs. The results have been very encouraging. A standalone benchmark showed a 12x improvement in S3 read throughput (from 21 MB/s to 269 MB/s). Increased throughput allowed our production jobs to finish sooner. As a result, we saw 22% reduction in vcore-hours, 23% reduction in memory-hours, and similar reduction in run time of a typical production job. Although we are happy with the results, we are exploring additional enhancements in the future. They are briefly described at the end of this blog.
In May 2020, Pinterest launched a partnership with Shopify that allowed merchants to easily upload their catalogs to the Pinterest platform and create Product Pins and shopping ads. This vastly increased the number of shopping ads in our corpus available for our recommendation engine to choose from, when serving an ad on Pinterest. In order to continue to support this rapid growth, we leveraged a key-value (KV) store and some memory optimizations in Go to scale the size of our ad corpus by 60x.
One of our biggest priorities at Pinterest is keeping Pinners safe, and that includes protecting them from spam. The Trust & Safety team’s goal is not only to catch spam, but to remove it as quickly as possible to minimize Pinner impact.
The goal of spammers is to make money, and the best way to do this is to spam at scale. It’s a numbers game: one million spam emails are much more effective than one spam email. In order to remove spam quickly, we look at common trends in spam attacks to identify suspect behavior.
To achieve the scale required to be effective, spammers must automate their actions, and each of these “attacks” can be thought of as a cluster. Each event within the attack cluster may share some common features, but different clusters will have a different set of common features.
For example, during an attack where a large number of Pins are created, a spammer might point all Pins to the same domain. While the domain may change between attacks, spammers are still trying to direct traffic to the same spam site.
One of our spam mitigation tactics is our rule engine, Guardian, which helps to identify common features in spam attacks.
Pinterest is where people go to plan and shop, making ideas and ads from brands helpful in taking Pinners from inspiration to action. It’s our goal to ensure ads continue to be additive and not intrusive on Pinterest. Because of the unique and powerful first party signals on the platform, advertisers can reach Pinners based on their interests, intent and engagement on the platform.
To help in delivering the right ads to the right Pinners in an audience of hundreds of millions of people, we offer advertisers features to achieve relevance including Actalike (AAL) audiences, also known in the industry as Lookalike audiences. AAL audiences help advertisers reach potentially new users via audience expansion.
In this blog, we’ll focus on the machine learning model component of relevant ads delivery and explain how we achieve high quality audience expansion through universal user embedding representations together with per-advertiser classifier models. We demonstrate the power of the proposed combined approach by showing better performance over both regression-based and similarity-based approaches.
HBase is one of the most critical storage backends at Pinterest, powering many of our online traffic storage services like Zen (graph database) and UMS (wide column data store). Although HBase has many advantages like strong consistency at row level in high volume requests, flexible schema, low latency access to data, and Hadoop integration, it doesn’t natively support advanced indexing and querying. Secondary indexing is one of the most demanded features by our clients, but supporting that directly in HBase is quite challenging. Maintaining separate index tables as the number of indexes grows is not a scalable solution in terms of query efficiency and code complexity. This motivated us to build a storage solution called Ixia, which provides near real-time secondary indexing on HBase. The design is largely inspired by Lily HBase Indexer.
To best serve Pinners, creators, and advertisers, Pinterest leverages Flink as its stream processing engine. Flink is a data processing engine for stateful computation over data streams. It provides rich streaming APIs, exact-once support, and state checkpointing, which are essential to build stable and scalable streaming applications. Nowadays Flink is widely used in companies like Alibaba, Netflix, and Uber in mission critical use cases.
To achieve our mission of bringing everyone inspiration through our visual discovery engine, Pinterest relies heavily on making data-driven decisions to improve the Pinner experience for over 475 million monthly active users. Reliable, fast, and scalable interactive querying is essential to make those data-driven decisions possible. In the past, we published how Presto at Pinterest serves this function. Here, we’ll share how we built a scalable, reliable, and efficient interactive querying platform that processes hundreds of petabytes of data daily with Apache Spark SQL. Through an elaborate discussion on various architecture choices, challenges along the way, and our solutions for those challenges, we share how we made interactive querying with Spark SQL a success.
At Pinterest we’ve worked to greatly improve data processing efficiency. One quote that resonates with our unique approach is from writer Antoine de Saint-Exupéry: “Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”
Ultimately, we process petabytes of Thrift encoded data at Pinterest. Most jobs that access this data need only a part of it. To meet our unique needs, we devised a way to efficiently deserialize only the desired subsets of Thrift structures in each job. Our solution enabled us to significantly decrease our data processing resource usage: about 20% reduction in vcore usage, 27% reduction in memory usage, and 36% reduction in intermediate data (mapper output).
Enhancement to Kafka MirrorMaker to reduce CPU/memory pressure.
Pinterest is a visual platform at its core, so the need to understand and act on images is paramount. A couple of years ago, the Content Quality team designed and implemented our own batch pipeline to detect similar images. The similarity signal is widely used at Pinterest for use cases varying from improving recommendations based on similar images to taking down spam and abusive content. However, it was taking several hours for the signal to be computed for newly created images, which was a long window for spammers and abusers to harm the platform. So recently, the team implemented a streaming pipeline to detect similar images in near-real-time.