公司:pinterest
Pinterest(中文译名:缤趣),是一个网络与手机的应用程序,可以让用户利用其平台作为个人创意及项目工作所需的视觉探索工具,同时也有人把它视为一个图片分享类的社交网站,用户可以按主题分类添加和管理自己的图片收藏,并与好友分享。其使用的网站布局为瀑布流(Pinterest-style layout)。
Pinterest由美国加州帕罗奥图的一个名为Cold Brew Labs的团队营运,创办人为Ben Silbermann、 Paul Sciarra 及 Evan Sharp。2010年正式上线。“Pinterest”是由“Pin”及“interest”两个字组成,在社交网站中的访问量仅次于Facebook、Youtube、VKontakte以及Twitter。
Pinterest’s Analytics as a Platform on Druid (Part 2 of 3)
In this blog post series, we’ll discuss Pinterest’s Analytics as a Platform on Druid and share some learnings on using Druid. This is the second of the blog post series, and will discuss learnings on optimizing Druid for batch use cases.
Pinterest’s Analytics as a Platform on Druid (Part 1 of 3)
In this blog post series, we’ll discuss Pinterest’s Analytics as a Platform on Druid and share some learnings on using Druid. This is the first of the blog post series with a short history on switching to Druid, system architecture with Druid, and learnings on optimizing host types for Mmap.
Improving efficiency and reducing runtime using S3 read optimization
We describe a novel approach we took to improving S3 read throughput and how we used it to improve the efficiency of our production jobs. The results have been very encouraging. A standalone benchmark showed a 12x improvement in S3 read throughput (from 21 MB/s to 269 MB/s). Increased throughput allowed our production jobs to finish sooner. As a result, we saw 22% reduction in vcore-hours, 23% reduction in memory-hours, and similar reduction in run time of a typical production job. Although we are happy with the results, we are exploring additional enhancements in the future. They are briefly described at the end of this blog.
How we scaled the size of Pinterest’s ad corpus by 60x
In May 2020, Pinterest launched a partnership with Shopify that allowed merchants to easily upload their catalogs to the Pinterest platform and create Product Pins and shopping ads. This vastly increased the number of shopping ads in our corpus available for our recommendation engine to choose from, when serving an ad on Pinterest. In order to continue to support this rapid growth, we leveraged a key-value (KV) store and some memory optimizations in Go to scale the size of our ad corpus by 60x.
Fighting Spam using Clustering and Automated Rule Creation
One of our biggest priorities at Pinterest is keeping Pinners safe, and that includes protecting them from spam. The Trust & Safety team’s goal is not only to catch spam, but to remove it as quickly as possible to minimize Pinner impact.
The goal of spammers is to make money, and the best way to do this is to spam at scale. It’s a numbers game: one million spam emails are much more effective than one spam email. In order to remove spam quickly, we look at common trends in spam attacks to identify suspect behavior.
To achieve the scale required to be effective, spammers must automate their actions, and each of these “attacks” can be thought of as a cluster. Each event within the attack cluster may share some common features, but different clusters will have a different set of common features.
For example, during an attack where a large number of Pins are created, a spammer might point all Pins to the same domain. While the domain may change between attacks, spammers are still trying to direct traffic to the same spam site.
One of our spam mitigation tactics is our rule engine, Guardian, which helps to identify common features in spam attacks.
The machine learning behind delivering relevant ads
Pinterest is where people go to plan and shop, making ideas and ads from brands helpful in taking Pinners from inspiration to action. It’s our goal to ensure ads continue to be additive and not intrusive on Pinterest. Because of the unique and powerful first party signals on the platform, advertisers can reach Pinners based on their interests, intent and engagement on the platform.
To help in delivering the right ads to the right Pinners in an audience of hundreds of millions of people, we offer advertisers features to achieve relevance including Actalike (AAL) audiences, also known in the industry as Lookalike audiences. AAL audiences help advertisers reach potentially new users via audience expansion.
In this blog, we’ll focus on the machine learning model component of relevant ads delivery and explain how we achieve high quality audience expansion through universal user embedding representations together with per-advertiser classifier models. We demonstrate the power of the proposed combined approach by showing better performance over both regression-based and similarity-based approaches.
Building scalable near-real time indexing on HBase
HBase is one of the most critical storage backends at Pinterest, powering many of our online traffic storage services like Zen (graph database) and UMS (wide column data store). Although HBase has many advantages like strong consistency at row level in high volume requests, flexible schema, low latency access to data, and Hadoop integration, it doesn’t natively support advanced indexing and querying. Secondary indexing is one of the most demanded features by our clients, but supporting that directly in HBase is quite challenging. Maintaining separate index tables as the number of indexes grows is not a scalable solution in terms of query efficiency and code complexity. This motivated us to build a storage solution called Ixia, which provides near real-time secondary indexing on HBase. The design is largely inspired by Lily HBase Indexer.
Unified Flink Source at Pinterest: Streaming Data Processing
To best serve Pinners, creators, and advertisers, Pinterest leverages Flink as its stream processing engine. Flink is a data processing engine for stateful computation over data streams. It provides rich streaming APIs, exact-once support, and state checkpointing, which are essential to build stable and scalable streaming applications. Nowadays Flink is widely used in companies like Alibaba, Netflix, and Uber in mission critical use cases.
Interactive Querying with Apache Spark SQL at Pinterest
To achieve our mission of bringing everyone inspiration through our visual discovery engine, Pinterest relies heavily on making data-driven decisions to improve the Pinner experience for over 475 million monthly active users. Reliable, fast, and scalable interactive querying is essential to make those data-driven decisions possible. In the past, we published how Presto at Pinterest serves this function. Here, we’ll share how we built a scalable, reliable, and efficient interactive querying platform that processes hundreds of petabytes of data daily with Apache Spark SQL. Through an elaborate discussion on various architecture choices, challenges along the way, and our solutions for those challenges, we share how we made interactive querying with Spark SQL a success.
Improving data processing efficiency using partial deserialization of Thrift
At Pinterest we’ve worked to greatly improve data processing efficiency. One quote that resonates with our unique approach is from writer Antoine de Saint-Exupéry: “Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”
Ultimately, we process petabytes of Thrift encoded data at Pinterest. Most jobs that access this data need only a part of it. To meet our unique needs, we devised a way to efficiently deserialize only the desired subsets of Thrift structures in each job. Our solution enabled us to significantly decrease our data processing resource usage: about 20% reduction in vcore usage, 27% reduction in memory usage, and 36% reduction in intermediate data (mapper output).
What we learned from an iOS app OOMs incident
2020年初,我们开始看到Pinterest的iOS应用的内存外(OOM)崩溃率明显升高。这一事件导致无崩溃用户率(CFUR)下降,从之前的99%下降到96%,这是一个急剧下降。发生了什么事?
我们在这一过程中改进了许多系统,但这些经验可以单独写成一篇博文。这篇博文的主要重点是与更广泛的iOS社区分享我们通过这个iOS的具体问题学到的东西。
Building a Label-Based Enforcement Pipeline for Trust & Safety
随着Pinterest的发展,用户和企业的数量不断增加,提供一个安全和值得信赖的体验是我们的首要任务之一。每天,该平台提供数十亿个Pins和Board,以激发Pinners。面对如此多的图钉和活动,如何在释放高质量内容传播的同时,为内容安全提供及时和一致的决策,可能是一个挑战。在这篇博文中,我们将对我们面临的问题进行技术上的深入研究,并介绍我们如何建立一个基于标签的执行管道来解决这些问题并大规模地打击滥用。
How Pinterest Fights Spam Using Machine Learning
数以亿计的人定期访问Pinterest,在数十亿的Pins中直观地发现鼓舞人心的想法。灵感是一个很高的标准,我们必须保持警惕,确保品客不会看到垃圾邮件、有害内容或错误信息。为了执行我们的社区政策并维持一个鼓舞人心的环境,我们使用最新的机器学习技术来建立自动化系统,迅速检测并打击垃圾邮件内容和垃圾邮件发送者。
我们的反垃圾邮件系统由反应性和主动性两部分组成,以有效对抗对抗性滥用者--那些故意试图逃避系统的用户。我们的主动式系统由复杂的机器学习模型组成,而被动式系统包括在实时规则引擎中执行的规则和轻量级机器学习模型。我们不仅使用最新的建模技术,而且通过添加新的数据和探索新的技术突破,定期对这些模型进行迭代,以便随着时间的推移保持或提高其性能,从而有效解决垃圾邮件。
Shallow Mirror
Enhancement to Kafka MirrorMaker to reduce CPU/memory pressure.
How we reduced Pinterest’s iOS app size by 30+% / 50MB
我们都知道应用程序的大小(下载大小和本地安装大小)很重要,而且应用程序的大小和客户参与度之间存在着关联。很多时候,人们根据大小来决定是否使用软件,甚至按兆字节支付带宽费用。更不用说,当应用程序大小增加并导致用户试图释放其设备上的磁盘空间时,卸载率可能会上升。
最近,我们对Pinterest的iOS应用v9.1进行了改进,大大减少了其大小。
Detecting Image Similarity in (Near) Real-time Using Apache Flink
Pinterest is a visual platform at its core, so the need to understand and act on images is paramount. A couple of years ago, the Content Quality team designed and implemented our own batch pipeline to detect similar images. The similarity signal is widely used at Pinterest for use cases varying from improving recommendations based on similar images to taking down spam and abusive content. However, it was taking several hours for the signal to be computed for newly created images, which was a long window for spammers and abusers to harm the platform. So recently, the team implemented a streaming pipeline to detect similar images in near-real-time.