Events: The 4th pillar of Booking.com’s Observability platform

TL;DR: Metrics, Logs and Traces are three main pillars of observability. At Booking.com, we majorly use Booking.com’s Events , a proprietary solution developed in-house, to generate traces, logs and most of the metrics to fulfil the majority of our observability needs, and we handle tens of millions of Events per second.

Booking.com has distributed services running across a hybrid cloud and each one serves different purpose. For example “Service A” takes care of order processing, “Service B” of inventory etc. All services need observability to examine their availability and performance.

What are Events at Booking.com?

The main building block of the Booking.com Events is an Event. An Event is a key-value structure that contains a lot of information about a particular unit of work.

For example, an HTTP request would contain a set of all warnings/errors generated during the request processing. It can include performance information, such as the duration of the request, the number of database queries and their latency, also enabled A/B experiments, application-specific information, etc.

An Event snippet would look like this

{
"availability_zone" : "london",
"created_epoch": "1660548925.3674",
"Service_name": "service A",
"git_commit_sha": "..", …

}

At a first glance, it may resemble a structured log, though Events have some important differences.

Log messages usually focus on a single info/warn/error message, potentially with some context metadata attached to it. While an Event accumulates information over an extended period from various functions within the work unit.

Why does Booking.com use Events?

Events allow having a full context about user input, performance characteristics, runtime environment, etc. about a given unit of work (HTTP request, cronjob, etc). This information is later used to generate classical observability pillars like metrics, logs and traces. It also allows running analytics queries over the Events data.

Events help uncover various questions that require the involvement of different interconnected systems. For example, if we detect errors during the flight booking process, from the Events we can determine if these errors are affecting a particular group of users, if they are caused by bots, or if they are related to ongoing experiments on our site, and more.

The rich context of Events allows us to retrieve information which spans across multiple software components in the Booking.com ecosystem.

How do we use the Events to support observability needs?

Events are sent from applications to Kafka. Those Events are then consumed to generate Metrics, Logs and traces.

On a high level, in Kubernetes, the application creates Events using the ‘Events library’. The library sends the Events to the Event-proxy daemon running on the host machine. It does 3 important things:

It enriches the Event with various metadata. e.g. adding the physical host where it received the Events.
It routes the Events to different Kafka topics based on the custom routing specified. e.g. sending all the Events from the order service to the order Kafka topic.
It splits a single incoming Kafka message to multiple small Kafka messages for effective consumption.

For brevity, other compute platforms are excluded in the picture:

Bare-metal servers — Yes, we do have them! Their Events setup design is quite similar to the Kubernetes platform.
Cloud native compute — Cloud-native computing does not use the Events system; instead, it uses OpenTelemetry and cloud-native observability. For instance, cloud-native serverless platforms like Lambda use CloudWatch.

The Event-proxy then sends Events to Kafka clusters. Several ‘Event consumers’ start consuming those Kafka messages and serving different purposes.

Let me brief you about 3 important Events consumers.

Distributed Tracing Consumer: It handles distributed tracing in Booking.com, by uploading the spans to our partner, Honeycomb.
APM generator: This consumer generates different APM (application performance monitoring) metrics and stores them in Graphite. e.g. distinct count of actions for each application, failure http percentile.
Failed Event Processor: This consumer is interested only in Events with errors and warnings and writes them to ElasticSearch. Those errors and warnings serve the debugging purposes for the developer community in the company.

Okay, what’s next?

Events are great, it fulfils most of our observability needs. However, there are a couple of major challenges with the Events.

Events are Booking.com specific. So we need to maintain the Events libraries for code instrumentations
Third-party tools that we use inside the company can’t be onboarded to Booking.com’s observability infrastructure without writing any custom conversion code.

While addressing these challenges, we have evaluated the open-source Observability framework OpenTelemetry, also known as OTel for short. OTel is a framework for instrumenting, generating, collecting, and exporting telemetry data (telemetry data types) such as traces, metrics, logs.

There are a few reasons why we would like to use OTel.

OpenTelemetry code instrumentation is supported by wide range of programming languages. This will eliminate the need to maintain the Events library code for instrumentation.
OpenTelemetry is extensible (more read). This will help our existing internal tools to seamlessly adapt OpenTelemetry.
OTel is natively supported by a number of vendors. So there is no vendor lock-in.

So, while the Booking.com Events does a fantastic job. It helps us generate the three pillars of observability (Metrics, Logs and Traces). But we believe OTel is worth giving a try because of the above-mentioned reasons.

In future, we will share our journey in detail, transitioning Booking.com observability ecosystem from custom Events to OpenTelemetry.

Till then, Happy Observability!