Columnar DB File Reader V2: A Complete Rewrite
One of the main pillars of Mixpanel is our proprietary columnar store database, ARB, which we specifically designed to meet the needs of our customers. In this blog post, we delve into a comprehensive rewrite of the event reader code responsible for parsing the columnar files. The primary objective is to significantly enhance query performance, particularly for those with selective filters.
Given that the new abstractions deviated quite substantially from the old ones, we seized this opportunity to also migrate from C to C++ to modernize our codebase. With the introduction of the V2 reader, queries are now 12% faster on average, with some of the slower ones showing improvements of up to 75%.
slower and more selective queries benefit the most from V2 implementation
Reader V1
Let’s imagine a streaming service Mixpanel customer who wants to find the number of times users played romantic movies released before they were born. Typically, the user’s year of birth is fetched from user profiles and the movie year and genre from a lookup table by movie name/ID. However, for the sake of simplicity let’s assume they are just event properties for the time being. An eval node (a recursive tree-like structure for representing a selector expression) for the given filter would look like this:
Filters’ root nodes always output a boolean. The properties eval nodes are highlighted in yellow
In V1, the ARB reader first loads all necessary columns for filter evaluation via file memory mapping (mmap). It then proceeds to instantiate column cursor objects for each loaded column. Subsequently, the ARB reader iterates through the data one event at a time, advancing the cursors accordingly. For each event, the ARB reader retrieves the corresponding values from the columns and forwards them to the evaluation node of the filter. This evaluation node, in turn, produces a boolean output, determining whether the event should be included or excluded based on the evaluation result. Upon inclusion, the event gets passed to the specific query code that does further processing based on the query type (insights, funnels, flows…etc). Whether the movie genre is Romance or not, the reader will load up the values of release year and user birth year from file to memory. You can already begin to see how this can be made more efficient.
The property eval nodes are evaluated whether their parent eval node get evaluated or not
Reader V2
1- Predicate Pushdown
The main idea of predicate pushdown is to push down the premature evaluation of the “property[x]” eval nodes until they are actually needed to avoid loading column values from files unnecessarily. To achieve that, we break down the eval node tree into subtrees whose root nodes evaluate to a boolean that can be combined via AND or OR. Each smallest subtree would then become responsible for the column cursors of its own properties, incrementing them only when needed.
The property eval nodes are evaluated only if their parent node gets evaluated.
2- Batching and CPU Cache
The ARB reader uses mmap to access the file’s data, so to improve the cache hit ratio on data access it’s better to read as much data as possible from one column before reading from the next. The new code architecture allows us to do just so. Instead of evaluating the entire tree for one row at a time, we will evaluate each subtree pertaining to single column for a batch of rows before progressing to the next subtree. Only the ranges of rows that still require further evaluation within a batch are passed to the next subtree. For instance, If the subtrees’s root is an “AND” node, only the ranges evaluated to true are passed to the latter subtrees to be evaluated. Conversely, if their root is an “OR” node, only the ranges evaluated to false are passed on. Additionally, this approach enables the column cursor to advance to the next checkpoint in the column if the number of skipped events is big enough. Due to variable-sized values, we can’t always index into the desired event in a column. Hence we store frequent checkpoints for fast skipping.
no work is done for excluded ranges at every node
3- Repeat Encoding
In columnar format, if a value is the same for many events in a row, we only store the value once, followed by a special encoding for specifying how many times it repeats. This is a common compaction strategy. The new architecture, which processes columns in batches, allows for efficiency gains by using repeat encoding to skip over consecutive rows with the same value, thus avoiding redundant evaluations.
We use repeat encoding to compact consecutive values that are similar
4- Eval Node Cache
Some of the most expensive single property eval nodes involve strings, for example Lower(Event[“genre”]) in [“romance”, “romcom”, “action“, “Drama”, “thriller”] necessitates first copying the string in lowercase then conducting five string comparisons. Given that string columns often exhibit low cardinality, typically corresponding to categories such as countries, states, brands, or movie genres, etc, in our columnar format, we consolidate all strings for a specific column into a dictionary and utilize dictionary indices within the column’s data section (Check our blog post on column dictionaries). To avoid computing this eval node for the same country tens of thousands of times, we cache the results if the value of the property is a string. We use two bit vectors to do so, the first for whether it’s been previously evaluated, and the second for the evaluation result; both indexed by the string’s dictionary index. In addition to minimizing evaluation, this also minimizes dict index to string lookup. This only works when the eval node subtree only needs one property. It gets complicated for subtrees that depend on multiple properties such as property[“origin”] == property[“destination”] because evaluation results is cached by dictionary index. We try to solve this in the next section. We avoid caching for other value types for two reasons. Firstly, bit vectors aren’t suitable for unbounded values. Secondly, while hash maps are an option, they are expensive, especially when compared to single property eval nodes.
We use dictionary index rather than strings directly in the column’s data section
5- Multi Property Eval Node Cache
Oftentimes the slowest queries at Mixpanel are the ones with the most complicated filters: filters that contain many multi property eval node sub trees that can’t easily benefit from the previous performance improvements. So we decided to try adding eval node cache for multi property eval nodes sub trees, even though theoretically, the cardinality of the cache could be too high and hence adding no benefit, but practically it works pretty well. It is highly dependent on both the data shape and filter shape.
In many cases the columns in the filter are not fully independent. For example let’s imagine a query that filters on the rating of the movie or tv show watched that are higher than 7.5. If it is a movie, then the rating is fetched from the movie lookup table, otherwise it is fetched from shows lookup table based on the show’s name and season number.
This subtree cannot be broken down further and it depends on three unique properties : type, name, season
As you can quickly discern, the cardinality in this context is not excessively high: the name property will consistently maintain the same type, while the season property will only be present for shows. Consequently, the total number of evaluations significantly falls short of the multiplication of the cardinality of these three properties. This type of filter occurrence is quite common.
We opted to implement a cache based on the concatenation of these properties’ dictionary indices (rather than their actual strings), utilizing a flat hashmap due to the infeasibility of employing bit vectors in this scenario. Maps are worth it here since these type filters can be expensive to compute. To prevent unnecessary overhead (both in terms of memory and CPU) in cases of excessively high cardinality, we wait until the cache map accumulates 10k entries, at which point we evaluate its cache hit ratio. If deemed insufficiently beneficial, we discontinue its usage and population.
We conducted an experiment comparing this branch to the main branch to validate our hypothesis regarding cardinality. The results revealed an impressive 94% cache hit rate for queries utilizing a multi-property filter cache, accompanied by a noteworthy 21% performance enhancement. Despite only a fraction of queries reaping that benefit, they notably comprised some of the slowest ones. Consequently, the implementation of the cache proved to be worthwhile as we are targeting both the average latency as well as the tail latency.
Results for only queries that have filters utilizing the multi property eval node cache
Future Work
In addition to these significant performance enhancements, we modularized and modernized our codebase, eliminating substantial technical debt in the process. Yet, there remains ample opportunity for further refinement. For instance, preprocessing filters and rewriting them in more optimal form. The most selective eval node subtrees can be moved to be the first children of the “AND” node, and the last children of the “OR” node, minimizing the total number of evaluations. Furthermore, certain filters can be rearranged to ensure only a single property evaluation node per subtree, maximizing the advantages offered by the V2 reader. For instance, the previously mentioned filter can be rearranged to minimize the number of unique properties found in each eval node subtree:
the filter is transformed into nodes of ANDs and ORs with subtrees of mostly a single property per subtree.
If you enjoy low level performance optimizations like the ones mentioned in this blog — Mixpanel engineering is hiring!