How Mixpanel delivers funnels up to 7x faster than the data warehouse

Co-authored with Illirik Smirnov

Introduction

Now that Mixpanel can connect to tables from the data warehouse, many are curious how Mixpanel stacks up to SQL, the traditional interface for warehouse data in terms of query performance. Mixpanel not only eliminates the need to write SQL but also saves time with improved performance, thanks to our custom-built database optimized for product analytics.

Today, we’ll benchmark Mixpanel’s performance against Snowflake, a popular data warehouse. Our focus will be on funnels, a critical type of product analytics query used to determine conversion rates and time-to-convert within defined product and business flows.

At Mixpanel, we built our own database called Arb. It’s an efficient column store, with a purpose-built query engine, based on the event data model. By modeling client, server, and warehouse data as events, we were able to optimize the performance of many of the common product analytics queries that data warehouses struggle with.

To be clear before we get to the comparison— at Mixpanel, we love the data warehouse for many use cases. We store and process almost all of our business data in BigQuery, maintain a large DBT project, and thoroughly dogfood Mixpanel’s data warehouse integrations (Data Pipelines and Warehouse Connectors). However, warehouses struggle with many common product analytics queries; their relational model, while flexible, isn’t optimized for that use case.

In our benchmarking, we found that Mixpanel is 3–7X faster than a 6XL Snowflake warehouse, the largest warehouse size available. Read on to learn more about:

  • How the architecture and event data model of Arb enables much more efficient funnel queries vs. Snowflake and its relational model
  • Why these advantages persist even with common warehouse optimizations, like clustered tables and warehouse scaling
  • The steps your data team can take to use Mixpanel with the data that’s already in your warehouse

How funnel queries work

First, let’s explain how funnel queries work. Imagine an online store wants to see how well their ads perform. They want to measure the conversion rate, or the percentage of users who complete a purchase after clicking the ad. They use a funnel to track users who first see an ad, then click on it, add the item to their cart, and finally make a purchase.

However, one metric doesn’t tell the full story. Some customers might buy a different product, some ads might perform better on certain platforms or with specific demographics, and some products with lower conversion rates might have higher profit margins. All of these examples require further slicing and dicing of the funnel query.

When funnel queries are fast and efficient, people can easily explore their data, test hypotheses, and uncover key insights. However, in a data warehouse, funnel queries are often slow and inefficient. Here’s what a typical funnel query looks like in SQL; a data warehouse has to scan large event tables, with complex joins on multiple conditions.

A typical funnel query in SQL:

WITH funnel_entries AS (
SELECT a."user_id", a."time",

ROW_NUMBER() OVER(PARTITION BY a."user_id" ORDER BY a."time" ASC) AS funnel_entry_order

FROM
TABLE AS a
WHERE
a."event" = 'watch video'
AND a."time" BETWEEN DATEADD(DAY, -365, CURRENT_DATE()) AND CURRENT_DATE() )

SELECT

COUNT(DISTINCT a."user_id") AS funnel_start_count,
COUNT(DISTINCT b."user_id") AS funnel_end_count
FROM
funnel_entries AS a
LEFT JOIN
TABLE AS b
ON
a."user_id" = b."user_id"
AND (a."time" < b."time" AND (TIMESTAMPDIFF(DAY, a."time", b."time") < 2))
AND b."event" = 'add to cart'
LEFT JOIN
TABLE AS c
ON
b."user_id" = c."user_id"
AND (b."time" < c."time" AND (TIMESTAMPDIFF(DAY, b."time", c."time") < 2))
AND c."event" = 'checkout'
WHERE
a.funnel_entry_order = 1

Here’s what this query looks like in the Snowflake query planner; as expected, scans and joins dominate the runtime. Performance will improve significantly after the scanned data enters cache (where it becomes Local Disk I/O, rather than Remote Disk I/O), but exploratory analysis will usually have a very diverse query pattern, reducing the benefits of caching.

On the other hand, Mixpanel was designed for fast and efficient exploratory analysis, including funnels, by using an events-based data model. Events contain an action, identifier, timestamp, and associated properties, and are ingested through infrastructure designed for events-based data. This allows us to shard events for efficient processing by our in-memory query engine, eliminating the costly fact-on-fact joins and full table scans that dominate the runtime of the equivalent query in warehouse. Combine this with cloud-scale multi-tenancy, and the result is a substantial performance improvement in many workloads — including funnels.

Benchmarking methodology

To benchmark Mixpanel’s query performance against Snowflake, we conducted a series of tests focusing on funnel queries, especially advanced features like holding property constant and specific order funnels. We mirrored these Mixpanel queries with corresponding SQL queries in Snowflake, testing each on the same dataset with different clustering keys. Here’s how we performed our benchmark:

Data Preparation

  • We prepared a sample dataset with relevant columns like user_id, event_name and time to simulate the events data.
  • The dataset contained 116M rows, spanning over one year. This ensures the dataset contains enough records to evaluate the impact of clustering keys on query performance.

Table Creation

  • We created three separate tables in Snowflake, each using a different clustering key configuration.
  • We loaded the same dataset into each table to maintain consistency for comparison.

Funnel Query Design

  • We designed a set of funnel queries to benchmark the advanced funnel functionality in Mixpanel:
  • Hold Property Constant: Analyze user behavior through funnel steps while maintaining consistency in properties like browser.
  • Specific Order Funnel: Ensure users follow a specific sequence of events to complete the funnel.
  • We tested five queries: two- and three-step funnels without these features, and three two-step advanced funnels (using each combination of advanced features).

SQL Query Execution

  • We executed these queries on each table with an empty cache, recording the performance metrics. All queries were run twice; once on an X-Small warehouse, and once on a 6X-Large, using standard warehouses (i.e. not Snowpark-optimized).
  • We used Snowflake’s query history and performance views to get query duration, as well as other metrics like bytes/rows scanned and compilation time. The runtime for the best clustering configuration was used.

Results

Mixpanel consistently outperforms Snowflake in funnel query speeds. Mixpanel demonstrates approximately 7X faster performance than the fastest Snowflake clustering configuration for the table being queried. This is on Snowflake’s X-Small warehouse, the most commonly used size. We also run the same queries on the largest Snowflake warehouse size possible, 6X-Large. There, Mixpanel still outperforms Snowflake by a factor of about 5X.

Here is the Hex notebook we used to generate and run the SQL queries.

Scaling, clustering, and why they don’t close the gap

The most common ways that slow warehouse queries get accelerated are compute scaling and clustered tables. However, our benchmarking results show that even with clustered tables and a larger warehouse, Mixpanel’s performance advantage persists.

Scaling our warehouse to a 6X-Large, the largest compute node Snowflake offers, improved performance for most queries, but remained slower than Mixpanel in all cases. Why can’t a warehouse with a ~$1,500/hr list price close the gap?

The answer is that scaling your warehouse just gives it more compute, memory, and local cache — it doesn’t magically speed up remote disk I/O or reduce synchronization time. Even on an X-Small, our queries don’t spill to disk, so the extra memory isn’t used. The query involves relatively little compute, and since there are no other queries running, we don’t benefit from reduced resource contention. This shouldn’t be a surprise; Snowflake’s docs themselves say that increasing warehouse size won’t necessarily speed up your queries.

Clustering also improved performance somewhat in our testing; the fastest configuration clustered first on event_name, then on user_id, then on time. However, with relatively small datasets such as this one (3.7GB), Snowflake’s default micropartitioning behavior is already quite performant, and the benefits may not outweigh the increased load during writes to the table. In any case, clustering only improves performance for queries that can take advantage, and you may not be able to anticipate the queries that your users will run during exploratory analysis.

How to get your existing data into Mixpanel

The best way to get your existing warehouse data into Mixpanel is to use our warehouse connectors, which allows you to configure syncs that bring data from your warehouse into Mixpanel on a regular basis. If historical data in your warehouse changes over time (e.g. enrichment data or events relying on late-arriving data), Mirror will propagate those changes into Mixpanel, ensuring that your metrics are consistent across Mixpanel and your warehouse.

Mixpanel also offers many other integration options, such as APIs that can be used in data pipelines and partnerships with leading reverse ETL providers. You can learn more about integration options with Mixpanel here!

Caveats and conclusion

As always, your mileage may vary. Your data, schemas, queries, and warehouse load will all impact how long your queries take. Snowflake and Mixpanel both have caching behaviors that can improve performance when similar queries are run in succession, which can provide significant performance improvements for use cases like dashboards and scheduled jobs.

Of course, the best way to see just how much faster your funnels can be with Mixpanel is to give it a try for yourself!

For detailed metrics and further insights, refer to the complete dataset and query results linked in the methodology section.

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.137.1. UTC+08:00, 2024-11-22 10:57
浙ICP备14020137号-1 $访客地图$