How PayPal Uses Real-time Graph Database and Graph Analysis to Fight Fraud

Quinn Zuo
The PayPal Technology Blog
11 min readDec 14, 2021

--

Prevent organized and repeat fraudsters by using a home-grown graph platform

By Quinn Zuo, David Zhang, Yan Zhang, and Nitin Sharma

Photo by Lukas Tennie on Unsplash

Introduction

E-commerce has grown exponentially in recent years, fueled by the COVID-19 pandemic. Digital payments are the lifeblood of e-commerce, and with their rise, comes the rise of payments fraud. Detection and prevention of fraud associated with digital payments is complex and ever-evolving, since fraudsters are constantly trying to find new ways to monetize. The payments industry itself is also a targeted domain by organized and repeat fraudsters.

PayPal supports over 400 million active consumers and merchants worldwide. Every second there are several thousand payment transactions. The relationships in this payment network are strong predictors of behaviors and great risk indicators that can reveal fraud. However, these relationships are too huge to traverse and analyze if they were to be stored in a relational database. Graph database, on the other hand, flattens the view and treats relationships (connections) as a first-class citizen and efficiently traverses through connections. Graph analysis can then be performed on top of the graph data to derive insights, such as if a buyer account is compromised and could lead to account takeover, or if a buyer is likely to interact with a risky seller in the next three months, etc.

Graph Database and Graph Analysis

Graph is a universal language for describing relationships between data points (entities) with two simple objects — vertices (nodes) and edges. PayPal’s two-sided network enables us to build a payments graph that maps transactions between buyers and sellers.

In the following diagram, buyers (Elina and Alex) and seller (Online store) are modeled as vertices in the graph. Edges are connections between the vertices. Edges can be relationships such as sending payments or accounts within the same household, etc. Although the graph data model outlined in this blog is not used in PayPal’s production platform, the principle of creating an adequate graph data model on top of the graph database still applies. That is, the design of a graph data model needs to take into consideration of query patterns, latency, and graph modeling needs.

Figure 1. Logical Graph

There are two types of graph analysis we perform. One type is discrete graph analysis. In this type of analysis, we analyze vertices like buyers, sellers, transactions, etc., to create risk assessment. We identify risk through associations of these vertices to known negative lists (e.g., bad tagging, anomalies, etc.).

Another type of analysis is connected graph analysis (or graph analytics). In this type of analysis, we can leverage graph algorithms like page rank to identify the most influential sellers or high frequency paths; or we can use connected components, clustering, and centrality algorithms to detect network or community to determine whether the connected network is a group of friends or a fraud ring formed by a set of coordinated fraudsters, etc.

Graph Platform Overview

PayPal’s graph platform is an integrated platform that consists of real-time, interactive, and analytics graph technology stacks.

Figure 2. Graph Platform

The real-time graph platform serves use cases where the graph query results are needed within a sub-second. The returned query results are features used in risk strategy rules, or embeddings used in machine learning models. The real-time graph is suitable for online fraud prevention. Interactive graph platform serves use cases where the query latency can be within a few seconds or minutes. Use cases for interactive graph usually need graph visualization and are suitable for risk back office or anti-money-laundering (AML) investigations. The analytics graph is used to uncover unknown patterns using graph algorithms, create graph features or embedding, and train graph ML models.

Figure 3. Integrated Graph Platform

The real-time, interactive, and analytics graph technology stacks work cooperatively to form a graph ecosystem that addresses different graph use case needs.

Real-time Graph Database

At PayPal, connecting different relationships in (near) real-time is important for fraud detection. Potential fraudulent activities can be linked in diverse ways soon after the connections are identified and created in graph. When we started the real-time graph platform, we considered build vs. buy. After evaluated several graph vendors and open source graph projects on the market, we found none of these option could fulfill our graph data scalability and query performance needs. We then decided to build a real-time graph database ourselves. Our needs for real-time graph platform include:

  • Customizable high performance query APIs
  • Sub-second level query latency and optimization
  • Seconds level data freshness (near real-time update)
  • Horizontal scalability with fault tolerance
  • Million QPS (query per second) throughput
  • Flexible data backfill from offline to online
  • Compliant to PayPal’s data security and privacy requirements
Figure 4. Architecture of Real-time Graph Stack

The above architecture elaborates real-time graph in both write path and read path. In the write path, offline data loading and event-based graph data updates are abstracted into one graph data process service (details below). While in read path, the graph query engine, with Gremlin[1] as a unified interface, is embedded into graph query service for upstream domain services integration.

Historical graph data loading, incremental graph data loading, and second-level event-based graph data update are requirements for real-time graph to support our business. The below figure is the offline batch loading channel and (near) real-time update channel combined to solve those data loading challenges:

Figure 5. Offline and Near real-time Data Loading

First, the offline channel is set up for graph snapshot data loading and continues to support daily or weekly graph delta data updates. Our graph instance must grow to account for new types of vertices, edges or properties. Thus, historical data of those new data types must be loaded by an offline data loading channel. From disaster recovery perspective, it’s important to have a way to recover data from offline data sources.

Second, event-based near real-time updating is another key to data refresh. Production data sources in PayPal have been abstracted as events/messages (such as login, transaction, and other business-oriented messages). By consuming those events/messages, new account vertices, and new edge links can be directly created in graph database when the business event occurs. Thus, our graph instance can be updated with the latest data in a second-level gap in near real-time.

Graph Query Language — Gremlin [1]

How do we support graph compute in real-time (like 10 milliseconds)? How do we compute a two-hops query when a merchant receives a payment in our repeat fraudster use case (which we’ll discuss further in the next section)? Like SQL in relational database, Gremlin as an Apache open-source graph DSL framework is leveraged in PayPal’s graph platform to conduct relationship query search.

Why Gremlin? In Figure 6, there is a simple graph query from IP ‘192.168.1.1’ to 1st hop linked account and then from account to asset in 2nd hop. By simple computation, you get a count of 2-hop asset. Gremlin is straightforward and easy to understand. More importantly, as real-time graph query requirement from our use case, 2-hops graph queries should be finished in 10 milliseconds on average. It’s easy to rewrite Gremlin storage layer to our own backend storage Aerospike [2] and optimization layer bound with storage Aerospike NoSQL database and some other in-parallel/batch/cache optimizations from PayPal’s graph platform.

Figure 6. Gremlin Based Graph Query Layer

Risk based decisioning is real-time, as payment events have low latency and high throughput characteristics. Traversal (seed) based graph compute should respond in real-time (100 milliseconds SLA). Our query service wraps Gremlin queries as template APIs. These APIs integrate with upstream compute or Risk decisioning micro services. This is the typical real-time graph compute usage in terms of scalability, resiliency, and high performance.

Graph Viewer

Besides real-time graph compute capability, PayPal’s graph platform has a graph-viewer which is another critical component to run Gremlin Graph query and present linking vertices and edges in a graphical view. The following figure, and any Gremlin-based graph query, can be executed from a web-based viewer after choosing a graph instance (like a database instance in a relational database). This is an intuitive way to view linking relationships and sub-graphs. Other UI based graph functionalities, like template based queries or expanding by double clicking vertex, are also supported to enhance graph exploration. Graph viewer has been widely adopted by agents, analysts, data scientists, and developers at PayPal.

Figure 7. Graph viewer

Generating Graph Embeddings with Mirror Graph

Given the temporal nature of fraud evolution, and the heterogeneity of e-commerce represented by activity performed by PayPal customers, one of the key challenges is to mathematically model the relationships and interactions between customer profile assets (such as IP, address, etc.). From a machine learning design point of view, typically predictive modeling tasks can be performed at [3]:

  1. Node Level, where in the problem involves classification of a node into one or more categories, prediction/regression on a continuous feature representing a node or simply dividing a graph into a collection of nodes based on one or more similarity metric akin to clustering.
  2. Edge Level, so as to classify an edge into one or more categories or to discern whether two or more nodes are linked by a set of edges.
  3. Graph Level which involve learning representations through approaches such as regression or classification or when given a series of graphs, to find groupings or similarities between them.

Representations, such as embeddings, can be learned by encoding the graph into lower dimensions at node, edge, or graph levels, too. Although such structurally enriched information has been classically learned through manual feature engineering on node neighborhoods, ensemble learning using all possible paths between node pairs, graph summary statistics or graph kernels, the resulting features may not be robust to the ever-evolving context of concept drift or systematic changes with time [4]. Thus, there is a significant case to be made for learning embeddings or representations directly on graphs depicted by the buyer-seller ecosystems at PayPal. More broadly, such embeddings have been found useful in the following ways:

  1. When trained in an unsupervised context, as features for fraud detection such as account take-over or stolen financial instruments (credit cards, for example).
  2. To assess potential connivance between two fraudulent entities posing as buyers and sellers, attempting to defraud protection programs offered by PayPal.
  3. To identify and learn the characteristics and growth trajectories of power users segregated by type of economic activity (buying/selling, goods/services purchased, sending money to friends and family) with a purpose for targeted marketing action, and micro-economic financial data oriented creditworthiness assessment.
  4. To detect organized fraud that occurs when a group of fraudsters collude to form implicit fraud rings either through identity theft or through fabrication thereof.
  5. Given a temporal sequence of graphs (or the associated lower dimensional representations such as embeddings), to predict evolution of the e-commerce activity for proactive fraud mitigation, identify opportunities for marketing action to improve adoption of specific products and services, to detect churn and to identify gaps in the current strategy action.
Figure 8. Hop neighbors and embedding space [5]

Specifically, for node embedding, the overarching idea is to:

  • Define an encoding from the nodes to embeddings and a node similarity function for the graph under consideration
  • Formulate the problem as a node parameter optimization, such that the similarity between the mappings of nodes in embedding space is as close as possible to the similarity between the nodes in the original graph [5]

Apart from several similarity approaches we have experimented, one that is of particular interest from a computational complexity and platform infrastructure perspective is related to multi-hop similarity, where embeddings can be trained to predict k-hop neighbors or overlaps can be measured between node neighborhoods.

Computationally, generating pairwise node similarities and optimizing for such similarities through low-dimensional embeddings is expensive with a large parameter space for exploration. The graph platform offers a mirror graph capability that allows scalable generation of multi-hop neighborhoods (simultaneously through parallel execution at account level, IP, or device level) for large graph-based networks which are then used to generate node embeddings.

Case Study: Repeat Fraudsters

In this case study, we will look at how repeat fraudsters’ pattern happens and how we leverage real-time graph to detect this fraud pattern in real-time.

Based on comprehensive exploratory analytics, derivation of story-based insights and fraud pattern assessments, we have found that fraud patterns from actors with malicious intent are sporadic, ever-evolving and rapidly reactive. Given the very short turnaround time for relaunching of fraud attack after the previous attempts have been foiled, it becomes imperative to anticipate such attacks swiftly, and dissuade such activities that could have potential for iterative fraud losses.

To demonstrate the idea, here’s a hypothetical situation. There are seven bad accounts, A-G, being restricted due to bad behavior in the past. All seven accounts have something in common — they are all linked to the same profile asset which was added within seven consecutive days. On the eighth day, there is a new account adding the same profile asset which is already a known bad. With the real-time graph, we can link the newly added account to the other existing bad actors in near real-time and prevent potential damage.

Figure 9. A Hypothetical Repeat Fraudsters Pattern

The benefits of real-time graph to address this type of fraudulent activities are as following:

1. The relationship query is much easier. A lot of the connections are already stored in the graph platform which makes the query between two connections only happens within a sub-second.

2. The visualization of the graph provides an intuitive way for teammates to inspect and investigate the connections between two accounts through their profile assets or digital fingerprint like IP address, etc.

3. Risk teammates can identify any relationships much more easily and quickly with graph viewer to spot fraud rings.

Conclusion

Graph technologies have proven to be very effective in fraud detection and prevention. Today, almost all PayPal risk models are using graph features. However, the fraudsters are getting more and more intelligent and fighting back. We continue to enhance the graph platform in the areas of storage and query scalability, on-demand loading, and optimize Graph Neural Networks training, and model time to market.

References

[1] Gremlin: Apache TinkerPop

[2] Aerospike: Aerospike Real-time Data Platform Aerospike (database) — Wikipedia

[3] A Comprehensive Survey on Graph Neural Networks

[4] Representation Learning on Graphs: Methods and Applications

[5] Representation Learning on Networks, SNAP, WWW 2018

--

--