Building a User Signals Platform at Airbnb

How Airbnb built a stream processing platform to power user personalization.

By: Kidai Kwon, Pavan Tambay, Xinrui Hua, Soumyadip (Soumo) Banerjee, Phanindra (Phani) Ganti

Overview

Understanding user actions is critical for delivering a more personalized product experience. In this blog, we will explore how Airbnb developed a large-scale, near real-time stream processing platform for capturing and understanding user actions, which enables multiple teams to easily leverage real-time user activities. Additionally, we will discuss the challenges encountered and valuable insights gained from operating a large-scale stream processing platform.

Background

Airbnb connects millions of guests with unique homes and experiences worldwide. To help guests make the best travel decisions, providing personalized experiences throughout the booking process is essential. Guests may move through various stages — browsing destinations, planning trips, wishlisting, comparing listings, and finally booking. At each stage, Airbnb can enhance the guest experience through tailored interactions, both within the app and through notifications.

This personalization can range from understanding recent user activities, like searches and viewed homes, to segmenting users based on their trip intent and stage. A robust infrastructure is essential for processing extensive user engagement data and delivering insights in near real-time. Additionally, it’s important to platformize the infrastructure so that other teams can contribute to deriving user insights, especially since many engineering teams are not familiar with stream processing.

Airbnb’s User Signals Platform (USP) is designed to leverage user engagement data to provide personalized product experiences with many goals:

Ability to store both real-time and historic data about users’ engagement across the site.
Ability to query data for both online use cases and offline data analyses.
Ability to support online serving use cases with real-time data, with an end-to-end streaming latency of less than 1 second.
Ability to support asynchronous computations to derive user understanding data, such as user segments and session engagement.
Ability to allow various teams to easily define pipelines to capture user activities.

USP System Architecture

USP consists of a data pipeline layer and an online serving layer. The data pipeline layer is based on the Lambda architecture with an online streaming component that processes Kafka events near real-time and an offline component for data correction and backfill. The online serving layer performs read time operations by querying the Key Value (KV) store, written at the data pipeline layer. At a high-level, the below diagram demonstrates the lifecycle of user events produced by Airbnb applications that are transformed via Flink, stored in the KV store, then served via the service layer:

Figure 1. USP System Architecture Overview

Key design choices that were made:

We chose Flink streaming over Spark streaming because we previously experienced event delays with Spark due to the difference between micro-batch streaming (Spark streaming), which processes data streams as a series of small batch jobs, and event-based streaming (Flink), which processes event by event.
We decided to store transformed data in an append-only manner in the KV store with the event processing timestamp as a version. This greatly reduces complexity because with at-least once processing, it guarantees idempotency even if the same events are processed multiple times via stream processing or batch processing.
We used a config based developer workflow to generate job templates and allow developers to define transforms, which are shared between Flink and batch jobs in order to make the USP developer friendly, especially to other teams that are not familiar with Flink operations.

USP Capabilities

USP supports several types of user event processing based on the above streaming architecture. The diagram below is a detailed view of various user event processing flows within USP. Source Kafka events from user activities are first transformed into User Signals, which are written to the KV store for querying purposes and also emitted as Kafka events. These transform Kafka events are consumed by user understanding jobs (such as User Segments, Session Engagements) to trigger asynchronous computations. The USP service layer handles online query requests by querying the KV store and performing any other query time operations.

Figure 2. USP Capabilities Flow

User Signals

User signals correspond to a list of recent user activities that are queryable by signal type, start time, and end time. Searches, home views, and bookings are example signal types. When creating a new User Signal, the developer defines a config that specifies the source Kafka event and the transform class. Below is an example User Signal definition with a config and a user-defined transform class.

- name: example_signal
type: simple
signal_class: com.airbnb.usp.api.ExampleSignal
event_sources:
- kafka_topic: example_source_event
transform: com.airbnb.usp.transforms.ExampleSignalTransform

public class ExampleSignalTransform extends AbstractSignalTransform {

public boolean isValidEvent(ExampleSourceEvent event) {

}

public ExampleSignal transform(ExampleSourceEvent event) {

}

Developers can also specify a join signal, which allows joining multiple source Kafka events with a specified join key near real-time via stateful streaming with RocksDB as a state store.

- name: example_join_signal
type: left_join
signal_class: com.airbnb.usp.api.ExampleJoinSignal
transform: com.airbnb.usp.transforms.ExampleJoinSignalTransform
left_event_source:
kafka_topic: example_left_source_event
join_key_field: example_join_key
right_event_source:
kafka_topic: example_right_source_event
join_key_field: example_join_key

Once the config and the transform class are defined for a signal, developers run a script to auto-generate Flink configurations, backfill batch files, and alert files like below:

$ python3 setup_signal.py --signal example_signalGenerates:[1] ../flink/signals/flink-jobs.yaml[2] ../flink/signals/example_signal-streaming.conf[3] ../batch/example_signal-batch.py[4] ../alerts/example_signal-events_written_anomaly.yaml[5] ../alerts/example_signal-overall_latency_high.yaml

[6] ../alerts/example_signal-overall_success_rate_low.yaml

User Segments

User Segments provide the ability to define user cohorts near real-time with different triggering criteria for compute and various start and expiration conditions. The user-defined transform exposes several abstract methods which developers can simply implement the business logic without having to worry about streaming components.

For example, the active trip planner is a User Segment that assigns guests into the segment as soon as the guest performs a search and removes the guests from the segment after 14 days of inactivity or once the guest makes a booking. Below are abstract methods that the developer will implement to create the active trip planner User Segment:

inSegment: Given the triggered User Signals, check if the given user is in the segment.
getStartTimestamp: Define the start time when the given user will be in the segment. For example, when the user starts a search on Airbnb, the start time will be set to the search timestamp and the user will be immediately placed in this user segment.
getExpirationTimestamp: Define the end time when the given user will be out of the segment. For example, when the user performs a search, the user will be in the segment for the next 14 days until the next triggering User Signal arrives, then the expiration time will be updated accordingly.

public class ExampleSegmentTransform extends AbstractSegmentTransform {

protected boolean inSegment(List<Signal> inputSignals) {

}

public Instant getStartTimestamp(List<Signal> inputSignals) {

}

public Instant getExpirationTimestamp(List<Signal> inputSignals) {

}

Session Engagements

The session engagement Flink job enables developers to group and analyze a series of short-term user actions, known as session engagements, to gain insights into holistic user behavior within a specific timeframe. For example, understanding the photos of homes the guest viewed in the current session would be useful to derive the guest preference for the upcoming trip.

As transform Kafka events from User Signals get ingested, the job splits the stream into keyed streams by user id as a key to allow the computation to be performed in parallel.

The job employs various windowing techniques, such as sliding windows and session windows, to trigger computations based on aggregated user actions within these windows. Sliding windows continuously advance by a specified time interval, while session windows dynamically adjust based on user activity patterns. For example, as a user browses multiple listings on the Airbnb app, a sliding window of size 10 minutes that slides every 5 minutes is used to analyze the user’s short term engagement to generate the user’s short term trip preference.

The asynchronous compute pattern empowers developers to execute resource intensive operations, such as running ML models or making service calls, without disrupting the real-time processing pipeline. This approach ensures that computed user understanding data is efficiently stored and readily available for rapid querying from the KV store.

Figure 3. Session Engagements Flow

Flink Operations

USP is a stream processing platform built for developers. Below are some of the learnings from operating hundreds of Flink jobs.

Metrics

We use various latency metrics to measure the performance of streaming jobs.

Event Latency: From when the user events are generated from applications to when the transformed events are written to the KV store.
Ingestion Latency: From when the user events arrive at the Kafka cluster to when the transformed events are written to the KV store.
Job Latency: From when the Flink job starts processing source Kafka events to when the transformed events are written to the KV store.
Transform Latency: From when the Flink job starts processing source Kafka events to when the Flink job finishes the transformation.

Figure 4. Flink Job Metrics

Event Latency is the end-to-end latency measuring when the generated user action becomes queryable. This metric can be difficult to control because if the Flink job relies on client side events, the events themselves may not be readily ingestible due to the slow network on the client device or the batching of the logs on the client device for performance. With these reasons, it’s also preferable to rely on server side events over client side events for the source user events, only if the comparables are available.

Ingestion Latency is the main metric we monitor. This also covers various issues that can happen in different stages such as overloaded Kafka topics and latency issues when writing to the KV store (from client pool issues, rate limits, service instability).

Improving Flink Job stability with standby Task Managers

Flink is a distributed system that runs on a single Job Manager that orchestrates tasks in different Task Managers that act as actual workers. When a Flink job is ingesting a Kafka topic, different partitions of the Kafka topic are assigned to different Task Managers. If one Task Manager fails, incoming Kafka events from the partitions assigned to that task manager will be blocked until a new replacement task manager is created. Unlike the online service horizontal scaling where pods can be simply replaced with traffic rebalancing, Flink assigns fixed partitions of input Kafka topics to Task Managers without auto reassignment. This creates large backlogs of events from those Kafka partitions from the failed Task Manager, while other Task Managers are still processing events from other partitions.

In order to reduce this downtime, we provision extra hot-standby pods. In the diagram below, on the left side, the job is running at a stable state with four Task Managers with one Task Manager (Task Manager 5) as a hot-standby. On the right side, in case of the Task Manager 4 failure, the standby Task Manager 5 immediately starts processing tasks for the terminated pod, instead of waiting for the new pod to spin up. Eventually another standby pod will be created. In this way, we can achieve better stability with a small cost of having standby pods.

Figure 5. Flink Job Manager And Task Manager Setup

Conclusion

Over the last several years, USP has played a crucial role as a platform empowering numerous teams to achieve product personalization. Currently, USP processes over 1 million events per second across 100+ Flink jobs and the USP service serves 70k queries per second. For future work, we are looking into different types of asynchronous compute patterns via Flink to improve performance.

Acknowledgments

USP is a collaborative effort between Airbnb’s Search Infrastructure and Stream Infrastructure, particularly Derrick Chie, Ran Zhang, Yi Li. Big thanks to our former teammates who contributed to this work: Emily Hsia, Youssef Francis, Swaroop Jagadish, Brandon Bevans, Zhi Feng, Wei Sun, Alex Tian, Wei Hou.