How We Migrated from StatsD to Prometheus in One Month

Eddie Bracho
Mixpanel Engineering
8 min readSep 29, 2023

--

We recently migrated all of our infrastructure metrics from StatsD to Prometheus and are very pleased with the results. The migration was a ton of work and we learned a lot along the way. This post aims to shed some light on why we migrated to Prometheus, as well as outline some of the technical challenges we faced during the process.

Background

Metrics are such a vital, commonplace component in today’s distributed system that it’s easy to forget that popular open source metrics protocols only started appearing in the last decade. StatsD came out of Etsy back in 2011 with Prometheus following shortly afterwards from SoundCloud in 2012. Both collect the same kinds of measurements (counters, gauges, and sample distributions), but how they’re architected differs significantly in ways that reflect the priorities of distributed systems they came from. For example, StatsD’s push-based model makes it dead-simple to collect metrics from any process that can send a UDP packet, whereas Prometheus’s decentralized pull-based model makes it easier to scale metrics collection.

Mixpanel was not slow to jump on the metrics bandwagon — we are an analytics company after all. We first began collecting metrics through StatsD in 2013 (just 2 years after our landmark Why We Moved Off the Cloud blog post 😛). For a fast-moving, multi-language tech stack running on a fixed number of machines, StatsD made a lot of sense. Scaling metrics was not even close to being a question in anyone’s mind, and client libraries were incredibly easy to implement and reason about.

Fast-forward several years and Mixpanel’s infrastructure had taken on a completely different shape. We moved back onto cloud (no landmark blog post this time) allowing us much more freedom to decouple and scale our performance-critical services. Our services now ran on Kubernetes with individual pods collecting tens-of-thousands of metric samples per second. Engineers also evolved much stricter expectations for the amount of dropped metric samples (none). To deal with these new requirements, we converted our StatsD sidecar connections to TCP and moved metric sample aggregation to the client process (which reduced contention induced by large flushes to StatsD connections).

At some point we had outgrown traditional StatsD and accidentally engineered our own proprietary metrics architecture. So when the time came for us to change metrics vendors, we realized we had a several good reasons to move into the Prometheus ecosystem:

Performance

A lot of the performance work we put into our internal metrics libraries mirrors closely with Prometheus’s architecture decisions. Prometheus’s pull-based model eliminates the chance for dropped metric samples, and recording a sample boils down to a single atomic operation in the client process.

Reduce Vendor Lock-in

PromQL and Grafana are open standards meaning that the work we put into migrating our collectors, dashboards and alert queries could follow us to a variety of different platforms. This would reduce vendor lock-in and give us a ton of flexibility to trial new metrics platforms.

Ecosystem

We love metrics but we’re not an infrastructure metrics company. By investing in the Prometheus ecosystem we can inherit all the updates and improvements that the community produces (ex: exemplars, native histograms) and free our attention to focus on other things that are important to our business.

Challenges

The challenges of this migration can be placed into three main categories: metrics collection, metrics reshaping, and query rewrites.

Metrics Collection

The first problem we had to deal with was how to ingest our existing metric samples into a Prometheus metrics stack. We imposed a few requirements for a solution:

  1. Metrics must be dual emitted in StatsD and Prometheus protocols during the migration period.
  2. Services must not be required to re-instrument application code.
  3. No client-side performance degradation.
  4. Access to standard Prometheus distribution metric types (summaries, histograms, native histograms).

There’s an existing tool called statsd_exporter maintained by the Prometheus organization that functions great for services emitting standard StatsD protocol. It works by relaying StatsD traffic from your application to your normal StatsD server while also converting metrics to Prometheus protocol and publishing them on a /metrics HTTP endpoint.

Exporter sidecar architecture

Unfortunately statsd_exporter didn’t quite fit the bill for us since many of our services aggregate metric samples in-process, meaning we would not have access to Prometheus’s distribution metric types. Whatever solution we came up with would need to be embedded in the client-side process.

So that’s exactly what we did. All our performance-critical services are written in Go, and fortunately statsd_exporter is open source and also written in Go. Our solution was to import statsd_exporter into our internal Go metrics SDK and have it process metric samples directly in memory. This approach also had some additional benefits:

  1. Services could seamlessly start converting to the official Prometheus Go SDK at their own pace.
  2. We would have a sidecar-less collection system after the migration.

After a bit of performance-tuning work we had solution that met all our requirements!

Embedded exporter architecture

Takeaways

  • Metrics collection can be solved ahead of time. It should be dead-simple for service owners to instrument their service for Prometheus.
  • statsd_exporter has already solved this problem, so leverage it however you can.
  • Ensure there’s a path for services to start using native Prometheus client SDKs going forward.

Metrics Reshaping

The next challenge we faced was the question of how to reshape our metrics. In this context, the shape of a metric refers to the name, labels, and units of a given metric. While the underlying data models of modern StatsD and Prometheus are virtually identical, they have much different conventions which can have frustrating implications for query ergonomics.

StatsD originally did not support tags and would encode every metric dimension in the metric name. While most implementations now support tags, the dot-namespace convention is still widely used and many metrics use a combination of name prefixes and tags to encode dimensions.

Prometheus on the other hand strongly discourages encoding dimensions in the metric name. In fact, certain PromQL queries will completely break when you try to aggregate across dimensions encoded in the metric name, such as the following success rate query:

sum(rate(
arb_lqs_grpc_server_LQS_Execute_OK[5m]
))
/
sum(rate(
{__name__=~"arb_lqs_grpc_server_LQS_Execute_.+"}[5m]
))

// Error expanding series: vector cannot contain metrics with the same labelset

In our initial experiments we quickly realized we had some very important type-1 decisions to make with regards to metric shaping. While it is possible write a query that hacks around this limitation, we absolutely did not want to normalize it for our engineers for years to come. We needed to find a way to move dimensionality entirely into labels, again without requiring engineers to re-instrument application code.

The first piece of low-hanging fruit was to replace our RPC metrics interceptors with standard open-source Prometheus interceptors, such as the one provided by grpc-ecosystem/go-grpc-middleware:

sum(rate(
grpc_server_handled_total{
grpc_service="LQS",
grpc_method="Execute",
grpc_code="OK"
}[5m]
))
/
sum(rate(
grpc_server_handled_total{
grpc_service="LQS",
grpc_method="Execute"
}[5m]
))

// No error!

This one change ended up having a huge impact, since RPC metrics are by far the most queried metrics.

For other metrics, statsd_exporter provides a config interface for defining metric mappings that uses glob-matching rules to reshape metrics on the fly. However in practice we found that engineers (who have limited time) would prefer to write a hacky query than figure out how to write a mapping rule requiring a full re-deploy of their service. It’s our hope that these remaining problematic metrics will eventually be reshaped and re-instrumented with the native Prometheus SDK.

Takeaways

  • PromQL does not handle name-encoded dimensions well.
  • Replacing RPC interceptors is a huge win for metric reshaping and requires very little effort.
  • Take extra care to figure out how you want to reshape metrics before migrating queries.

Query Rewrites

The final challenge was to rewrite all our dashboard and alert queries in PromQL. Dashboards needed to be migrated from our old vendor’s proprietary format to Grafana, while alerts needed to be migrated to our new vendor’s proprietary format.

Can We Automate it?

Initially we looked into tools to automatically rewrite our existing queries to PromQL, but it turns out that’s basically an unsolvable problem. There’s just too many nuances in the differences between time series query languages and their underlying data even within PromQL implementations, let alone across entirely different query language specs.

While we couldn’t automatically port queries, it technically would’ve been possible to programmatically convert dashboard and alert templates such that engineers only needed to fill in the queries. Unfortunately by the time we thought of this approach it was too late into the migration to have much of an impact.

Reduce the Pain

Since there was no way around the rote-labor of manually rewriting queries, our job became to make that process as painless as possible.

The work we put into the first two categories was a great start; Prometheus instrumentation for a service could be enabled with a single feature flag and our metrics were already reshaped and ready to be queried. Additionally, we set up multiple office hours per week where we could help engineers with PromQL questions or issues. We made sure to document everything to help bring all our engineers up-to-speed with PromQL as quickly as possible.

In total there were ~300 dashboards and ~600 alerts to migrate totaling to roughly 4,000 query rewrites. Between 20 engineers we were able to complete this phase of the migration in just over 1 month!

Takeaways

  • Make sure you fully understand the amount of work before committing to a migration
  • Automate conversion of dashboard and alert templates if possible.
  • Become a PromQL expert beforehand and schedule time to impart your knowledge to everyone else.

Final Thoughts

I’m incredibly proud of how efficiently our engineering team was able to execute this migration. I’m also excited to see how the Prometheus ecosystem will evolve over time, and I’m glad to say that Mixpanel will be along for the ride!

Do you love metrics as much as we do? If so, the DevInfra team at Mixpanel is hiring! Apply here and mention this blog post in your application!

--

--