Fighting spam with Guardian, a real-time analytics and rules engine

Pinterest Engineering
Pinterest Engineering Blog
11 min readFeb 25, 2021

--

Hongkai Pan | Software Engineer, Trust & Safety

As the Trust & Safety team, one of our major responsibilities is protecting Pinners from spam. Without protections, spam could potentially be all over Pinterest since spammers can script and generate activity at a rate much higher than actual users.

One of our most valuable tools to fight spam at Pinterest is our use of a rules engine. A rules engine allows us to look at a stream of events and take actions (such as blocking the message, deactivating the sender, or flagging the user for human review) if the criteria in the rule is met.

Here we’ll share the evolution of our spam-fighting rules and query, and what we’ve learned throughout the process.

History: Starting in Python

Our rules engine began in Python, where we could run rules to take action on specific scenarios. A simple rule to block an attack of spam messages might look something like:

Such a rule would deactivate new users who sent messages starting with the phrase “Hi baby,” on Android devices with accounts less than 10 days old. It’s very specific but oftentimes effective since adversarial actors would create thousands of accounts with activity all following the same pattern. Our actual rules tend to be quite a bit more complicated than that. Some are remediation-focused (like the one above), using specific attributes to identify user accounts that are actively attacking Pinterest. Other rules focus on proactive detection, using machine learning-based approaches to detect adversarial actors.

This Python-based system, internally called Stingray, served us until 2018. We had a UI that allowed people to easily create rules, click “enable,” and voila, a piece of code like the above would be run on all events hooked into our rules engine. A rule could be created and go live in production in less than 10 minutes. And who doesn’t like Python — simple process, right? (More on our vision for a spam rules engine in our blog post from 2015.)

In practice, however, rule creation involved much more than simply writing Python and getting the rule live in production. We found the full cycle looked more like:

  1. Wait for data to be logged and populated in Hive
  2. Write Presto/Hive queries until you find the spam pattern and evaluate for false positives. You’ll likely need to query many different tables in order to get the attributes you want
  3. Translate Presto/Hive query results to Python to create a rule.
  4. Enable the rule in “dry run” mode (a mode that only logs the action but doesn’t take it). This is necessary since you don’t want to risk bugs that might deactivate good users when translating to Python
  5. Validate that the rule is catching the intended users/Pins/ips.
  • If all is good, enable!
  • If there’s a bug, go back to step 3

Ultimately, only a small part of the workflow was discovering spam patterns. It also took days, involved a lot of context switches, and was monotonous and bug prone.

Guardian and the Improved Rule Creation Workflow

In late 2017, we realized the Python-based approach wasn’t scalable and set out to streamline the rule-creation workflow through a new system in Elixir. We built Guardian from scratch as a real-time query engine with a rule engine on top of the query engine. The dataset consists of a single denormalized table with 10B+ rows and 10k+ columns stored in a customized columnar format on disk across a cluster of machines.

Guardian combines into one place all the data we need for fighting spam. Each row represents an action and is enriched with our spam-fighting signals. Instead of performing complex joins across numerous tables, all information is in one place, making queries (and spam-fighting) simpler and faster. Additionally, we moved away from Python and all rules are in SQL-query format, further streamlining the rule creation and back-testing processes.

To illustrate more concretely, the example from above would look like the following in Guardian:

To back-test, the rule can be copy and pasted and translated into a query like so:

Rule creators can then easily see information about all users that would’ve been deactivated by the rule had it been active. Vice versa, rules being queries also means that as soon as rule creators are done with their analysis in Guardian, they can copy and paste their final query into a new rule. No additional time is needed for translating results into the corresponding rule engine syntax. What you got when querying is exactly what would happen if turned into a rule in production.

Developing a new query engine

Note the query syntax isn’t exactly SQL. Because we created our own query engine, we developed the language as well. It’s based on SQL, but we’ve also ported over popular functions used in Presto and Hive such as count_distinct (which uses HyperLogLog), regex_extract, and url_extract_path, and created numerous custom functions that analysts and engineers have requested such as calculating the standard deviation of a list of numbers or the Shannon Entropy of a string. With a built-in auto-completer, these functions are largely self-discoverable (though we also have documentations outlining every function).

The full cycle for simple rule creation now takes less than an hour (vs. days), with most of the time spent on analysis. We took out the rule translation step completely which made Guardian popular amongst analysts and engineers alike, opening up room for even more improvements.

Alongside the main rule engine and query engine, we added additional functionalities to help the analysis and rule creation process including the following.

Feature Discovery

Clicking on a row will pop up a UI allowing for column search and to view all the features of the row. With thousands of columns, there’s no way we can display all the columns in a query, and so this functionality is essential for seeing available columns for a given event.

Graph Visualizations

Spam often happens in spikes, and so we built in visualizations that make the spikes easily discernible in real-time.

The above query segments by ISPs. Guardian will graph the most popular ISPs matching the conditions. We can easily see that there’s abnormal spikes in user creation traffic from the ‘Beeline’ ISP, allowing us to further narrow down the attack pattern.

Backfilling Rules

Our query engine stores three weeks of data. Because rules are just queries, backfilling is easy and fast. We run the compiled rule query through our query engine and send the resulting actions downstream.

Instantaneous Query Results

Unlike Hive or Presto, we don’t wait to gather all the results. We display results as soon as we get them and then continuously append and update results. This way, users get feedback on their queries the instant they click “Run Query.”

The above query finds the image signature that’s been saved by the most number of distinct IP addresses (note we have custom functions to display an image if given the imageSignature). The query is only 4% done, but you can already get a sense of the most saved Pins, and the UI is updated every two seconds with the latest results.

Presto Connector

Though the dataset within Guardian is already rich with data relevant for spam, we occasionally want to JOIN it on other tables. Due to the enormous size of the table, it’s not feasible to dump the data to Hive or any other database format. Instead, we built a Presto Connector, which makes the Guardian dataset just another table in Presto. Queries using the table will result in Presto making an RPC to Guardian to extract the selected data.

Monitoring and Alerting

Along with actions like “deactivate_user,” “hide_pin,” “force_password_change,” users can also specify an action called “monitor” which will increase a metric whenever the condition matches. The metric would be hooked up to our internal monitoring and alerting tool where dashboards and alerts can then be created.

Counters / Aggregations

We found we frequently wanted to create aggregations amongst our stream of events — for example, counters for the number of Pins created per user. Use cases include creating a new rate limit, having the counter be a feature to a model, or having a strike system for policy purposes. Computing these counters was a prerequisite for a multitude of our rules. For the longest time, these counters were incremented in code or via config files. But this was complicated to maintain, especially if the conditions to increment the counter were complicated.

We then realized that the counter creation process can be made much smoother in Guardian, in which the conditions to increment are just queries.

To illustrate, this statement in Guardian will populate a counter counting the distinct number of ips flagging a domain as being spam:

pin_report_distinct_ips_per_domain_1w will then be just another column visible in Guardian and usable in rules. A counter can depend on another counter. The rule engine automatically calculates the dependency graph and determines the order in which rules are run.

Note that besides just counters, the count_distinct function can be replaced by many of our other aggregations such as sum, array_agg, and simpsons_index. Again, since it’s our custom language, we just add any aggregation that analysts or engineers requests into our language. This allows us to experiment with and deploy simple models quickly.

Infrastructure Cost Efficiency

This new rule engine is significantly cheaper than our old Python rule engine. Previously, we had to run Python on every single event. With Guardian, we can first turn a huge batch of events into a table and then, for each rule, run a single query over the table. Each additional rule is then super cheap computationally since running a query is cheap.

For example, the Python rule given at the beginning would be pre-compiled into the following query:

Infrastructure Design

Feature Enrichment

We tap Kafka for user activity events. We also expose a thrift API that allows Pinterest servers to call Guardian directly to get back a response on whether or not the event should be blocked (rules can specify the “block” action). Our feature enrichment cluster will take in these events with just a few attributes and then enrich them with information from a multitude of different sources including our geolocation library, user database, Pin database, ML model scores based on image / user_id / domain, website crawling data, and much more. The end result is one gigantic event that it’ll send to the rules engine. The rule/query engine is schemaless, so users are free to add/remove features within the feature enrichment layer.

Rules Engine

After feature enrichment, we use a process called micro batching, which allows us to batch different API requests together and create a table with hundreds of events that can be queried upon. As described previously, rules can then be run as queries over this new table. Resulting actions are sent to Kafka. A downstream service listens in to the Kafka topic and takes actions that Guardian specifies.

Some rules involve the fetching and storing of counters/aggregations. These rules store, update, and fetch the aggregations using Memcached and have the ability to add new columns to the table. Rules running afterwards in the dependency cycle can use the data from the newly created column.

Rules are stored in the MySQL database. We have a separate process within the rules engine that’s in charge of listening for changes and compiling the rules. Whenever a rule is activated or deactivated in the UI, this process will re-fetch all rules, compile them into queries, calculate the dependencies between rules, and then store the list of compiled rules to run into the local cache.

After the rules are all done running, all events (which now includes data from counters/aggregations) will be passed over to the query engine.

Real-time Query Engine

The query engine listens for new events that were produced by the rules engine. Each node will group events into segments containing 16k events each and then store the segment on disk. The delay between an event happening to it being available for queries is around a second or two.

Within each segment, events are stored in columnar format — that means instead of storing row-by-row, we store column-by-column. Queries will always only use a fraction of the available columns. By storing in columnar format, you can read from disk only the columns that are needed for the query. Within each column, we store the “type” (boolean, string, float, etc.) of each entry via an inverted index (we store a map of “type” to a 16k Bitmap). This allows us to store entries within the same column of the same “type” together to give us a big boost in performance. This also allows Guardian to be schemaless — instead of relying on the user to specify the type of each column, Guardian calculates and stores the types automatically. For an even greater performance boost, much of the storage and query operations were re-written and optimized in “C” because we needed fine-grained control of how memory is accessed and stored.

For parallelization, hundreds of mapper processes and a handful of reducer processes are opened per query. The segments are then evenly distributed to the mappers to be processed. If aggregation is needed (i.e. if there’s a GROUP BY clause), each mapper will hash the group by key and then send their results to the right reducer (so that all entries with the same group by key are sent to the same reducer). The UI will poll every two seconds for the results.

These optimizations allowed us to be significantly faster than Hive and Presto when running queries and allowed for a much more compact way to store data too.

Summary

Building a system to execute rules is fundamental to fighting spam. But building a system that can streamline the entire spam discovery and rule creation process is much more valuable. Guardian’s core design decisions such as unifying the language for queries and rules, having one single denormalized table, allowing the language to be highly customizable, and ensuring queries run in seconds gave us significant productivity gains and made Guardian an indispensable part of our spam-fighting tool kit.

We hope this blog was enlightening about the challenges of developing a rule engine.

Acknowledgements

Huge thanks to Harry Shamansky, Maisy Samuelson, and Kate Flaming for help with this blog post! Thanks to Preston Guillory, Cathy Yang, Sharon Xie, Alok Singhal, Farran Wang, and the rest of the Trust & Safety Signals team for helping to design and build Guardian, a monumental project for Trust and Safety at Pinterest!

To learn more about Engineering at Pinterest, check out our Engineering Blog, and visit our Pinterest Labs site. To view and apply to open opportunities, visit our Careers page.

--

--