Improving Efficiency Of Goku Time Series Database at Pinterest (Part — 3)
Monil Mukesh Sanghavi; Software Engineer, Real Time Analytics Team | Ming-May Hu; Software Engineer, Real Time Analytics Team | Xiao Li; Software Engineer, Real Time Analytics Team | Zhenxiao Luo; Software Engineer, Real Time Analytics Team | Kapil Bajaj; Manager, Real Time Analytics Team |
At Pinterest, one of the pillars of the observability stack provides internal engineering teams (our users) the opportunity to monitor their services using metrics data and set up alerting on it. Goku is our in-house time series database that provides cost efficient and low latency storage for metrics data. Underneath, Goku is not a single cluster but a collection of sub-service components including:
- Goku Short Term (in-memory storage for the last 24 hours of data and referred to as GokuS)
- Goku Long Term (ssd and hdd based storage for older data and referred to as GokuL)
- Goku Compactor (time series data aggregation and conversion engine)
- Goku Root (smart query routing)
You can read more about these components in the blog posts on GokuS Storage, GokuL (long term) storage, and Cost Savings on Goku, but a lot has changed in Goku since those were written. We have implemented multiple features that increased the efficiency of Goku and improved the user experience. This three part blog post series covers the efficiency improvements (view parts 1 and parts 2), and this final part will cover the reduction of the overall cost of Goku and Pinterest.
Overview of Goku Ecosystem
Architectural overview of Goku
- A sidecar running on every host in Pinterest emits metrics into a Kafka topic.
- Goku’s ingestor component consumes from this Kafka topic and then produces into another kafka topic (partition corresponds to GokuS shard).
- GokuS consumes from this second Kafka topic and backs up the data into S3.
- From S3, the Goku Shuffler and Compactor create the long term data ready to be ingested by GokuL.
- Goku Root routes the queries to GokuS and GokuL.
In this blog, we focus (in order) on architectural changes we made in Goku to achieve the following:
- Provide features to the client team for cost saving initiatives
- Reduce resource footprint of the storage nodes
- Help adapt less expensive instance types without affecting SLA
Helping Client Team Reduce Time Series Storage
The Goku team provided the following two features (Metrics namespace and Providing top write heavy metrics) to the client Observability team that helped them reduce the data stored on Goku. This was an example of good collaboration and effective results.
Metrics Namespace
Initially, Goku had a fixed set of properties for the metrics stored, such as
- in memory storage for less than one day old metrics data,
- Secondary storage (mix of ssd and hdd) storage for one to 365 days of metrics data,
- raw metrics data of 24 days,
- rolled up time series of 15 minutes and
- one hour granularity as data gets older, etc.
These were static properties defined during the cluster setup. Adding a metric family with different properties required setting up new clusters and pipelines. As time passed, we had requests for supporting more metric families with different configurations. With an aim to have a generic solution to support the ad-hoc requests, we added namespace support in Goku.
Definition: A namespace is a logical collection of a unique set of metric configurations/properties like rollup support, backfilling capability, TTL, etc. A metric belonging to a namespace fulfills all the configured properties of the namespace, and it can also belong to more than one namespace. In this case, multiple copies of the metric may be present in the storage.
Some example namespace (NS) configurations are shown below:
Fig 1: Example configurations of three namespaces named ns1, ns2, and ns3
In Fig 1, a metric belonging to namespace NS1 will have its last one day’s data points stored in memory, while older data points will be stored and served from disk. The metric will also have 15 minute rolled up data available for the last 80 days and one hour rolled up data afterwards. However, a metric belonging to namespace NS2 will not have any data stored on disk. Note how a metric belonging to namespace NS3 will have the capability to ingest data as old as three days (backfill allowed), whereas metrics in NS1 and NS2 cannot ingest data older than two hours.
Based on their requirements, a user can select their metric to be included in an existing namespace or have a new namespace created. The information about which namespace holds what metrics is stored in the same namespace configuration as a list of metric prefixes under each namespace (see metrics: [] in each configuration). For example, in Fig 1, metrics with prefixes metric2 and metric3 are stored in namespace NS2, whereas metric4 and metric5 prefixed metrics are stored in NS3. All other metrics are stored in the default namespace NS1.
The namespace configurations (Fig 1) are stored in a dynamic shared config file watched by all hosts in the Goku ecosystem (Fig 2). Any moment the contents of this file change, the Goku process running on the hosts is notified, and it parses the new content to understand the changes.
Fig 2: Goku architecture with namespace
Write path
The Goku Ingestor watches the namespace config file and internally creates a prefix tree,which maps the metric prefixes to the namespaces they belong to. For every datapoint, it queries the prefix tree with the metric name of the datapoint and checks the namespace the datapoint belongs to. It then forwards the datapoint to the Kafka topic (see Fig 1: for example, ns1_topic is kafka topic for namespace NS1) based on the namespace configuration of the target namespace. The GokuS cluster consumes data points from all the kafka topics (i.e. from every namespace). Internally, it manages a list of shards for each namespace and forwards the datapoint to the correct shard in the namespace. For backup, the metrics data in each namespace is stored in S3 under a separate directory (see S3 Bucket in each namespace configuration in Fig 1. Example: ns1_bucket is S3 bucket for namespace NS1 while ns3_bucket is for NS3). For all namespaces that require disk based storage for old data, the data points are processed by Goku Compactor and Shuffler and then ingested by GokuL for serving.
Read path
Goku Root maintains a prefix tree (mapping from prefix to namespace) similar to Ingestor after reading the namespace config file. Using the tree and the time range, the query is routed to the correct shard and namespace. A query can cross two namespaces depending on the time range.
Cost savings
The client Observability team did some analysis (not in the scope of this document) on the metrics data and inferred that a subset of old metrics data doesn’t need to be present at host granularity. Hence, a namespace was introduced to store a subset of metrics for only a day (only in GokuS and not GokuL). The partial storage reduction coming out of this is shown in Fig 6 below.
This feature will also be useful in the future when we move metrics to cost efficient storage based on their usage.
Providing Top Write-Heavy Metrics to Client Team for Analysis
Goku collects the top 20K metrics based on cardinality and cumulative datapoint count every couple of hours and dumps the data into an online analytical database for real time analysis. The Observability team has a tool that uses this data, applies some heuristics on top (not in the scope of this document), and determines metrics which should be blocked.
Fig 3: Number of time series stored in GokuS in early 2023 is ~16B
Fig 4: Number of time series stored in GokuS currently (end 2023 is ~10B)
With the help of the features provided by Goku, the Observability team was able to reduce the number of time series stored in GokuS by almost 37% from 16B (Fig 3) to ~10B (Fig 4).
Fig 5: Blocked metrics with high cardinality
Almost 60K metrics with high cardinality are blocked until now.
Fig 6: ~27% decrease in disk usage on GokuL nodes
And the results show up in the disk usage reduction of the GokuL hosts.
Architecture Changes/Fixes For Cost Reduction
Apart from the metrics namespace and the provision of top write heavy metrics, the Goku team made improvements and changes in the Goku architecture including but not limited to design and code improvements (in GokuS, Goku Compactor and Goku Ingestor), process memory analysis (in GokuS), and cluster machine hardware evaluation (GokuS, Goku Compactor and GokuL). The following sections focus on these improvements which were made primarily to reduce system resource consumption through which one can cut capacity (pack more in less) and hence reduce cost.
Indexing Improvements for Metric Name (GokuS)
A time series metadata or key consists of the following
Multiple hosts can emit time series for a unique metric name (e.g. cpu,memory,disk usage or some application metric).
The above table is a close example of a very small subset of possible time series names. It can be noted that many tag=value strings repeat in the time series names. For example: cluster=goku appears in four time series,az=us-east-1a appears in six time series, and os=Ubuntu-1 appears in all eight time series.
In the Goku process, every time series would store the full metric name like this: “proc.stat.cpu:host=abc:cluster=goku:az=us-east-1a:os=ubuntu-1” where “:” is the tag separator. We added tracking for the cumulative size of the strings used to store these metric names and inferred that these strings consumed a lot of the process virtual memory (and hence host memory, i.e. physical RAM).
Fig 7: Almost 12 GB memory per host consumed by metric name strings
Fig 8: ~8TB per cluster consumed by metric name strings
As we can see from Fig 7 and Fig 8, ~12 GB per host and ~8TB per Goku cluster in host memory was consumed by full metric name strings. We added tracking for cumulative size of the unique full metric name substrings (i.e. only metric name, only tag=value strings — let’s call them metric name components) in the code.
Fig 9: Cluster wide full metric name size and metric name components size
Fig 10: String duplicity almost 30X
As seen in Fig 9, the cumulative size of the metric name components in the cluster was a lot less. The full metric name size was almost 30–40x the cumulative size of the metric name components. We knew we had a huge opportunity to reduce the memory footprint of the process by storing a single copy of the metric name components and store pointers to the single copy rather than multiple copies of the same string.
When we implemented the change to replace the full metric name stored per time series with an array of string pointers (this was replaced by an array of string_views later for write path optimizations), we achieved good results.
Fig 11: Cluster-wide memory consumption for storing metric name
Fig 12: Host level memory consumption for storing metric name
After we deployed the change, we saw a sharp fall in the memory footprint for metric name storage. We saw a reduction of ~9 GB (12 GB to 3 GB) on an average per host (See Fig 12) in the Goku cluster. This improvement in the architecture also would save us in the future when we would see an increase in the number of time series for the same metric (increase in cardinality).
We had to make sure the code changes did not affect the query SLA we had set with the client team. While most of the query path was unaffected by the above modifications, the only change we needed to make was to specify our custom iterator for the boost::regex_match function (for regex matching the tags of the time series). The function would previously work on the full metric name string but now would work with a bidirectional custom iterator over a vector of string views.
Compaction Improvements (Goku Compactor)
The above method of using dictionary encoding to store full metric names as vectors of unique string pointers rather than original strings helped in the compaction service as well. As explained in the overview of Goku architecture at the start of this blog, the compactor creates long term data ready for GokuL ingestion. For doing this, the compactor routine loads multiple buckets of lower tiered metrics data (tiering strategy explained in Goku Long Term Architecture Summary in this blog) in memory and creates a single bucket for the next highest tier. It can also rollup the data at this step if needed. When doing this, it loads a lot of time series data (i.e. full metric names) as well as the gorilla compressed data points into memory. For higher tier compaction, we expect multiple billions of time series data being loaded into memory at the same time. This would cause hosts out of memory scenarios and the on-callers would have to fight the issue with replacing the hosts with another host having higher system RAM. We also would have to limit the number of threads doing this heavy tier compactions.
Fig 13: Memory usage drop seen per host doing heavy tier compaction after dictionary encoding
Fig 14: No OOMs after dictionary encoding deployed
However, after deploying the dictionary encoding related change, we were not only able to avoid OOM scenarios (Fig 14) but also able to permanently use instance types with less memory (Fig 13) than before. Of course, we haven’t had the need to add on demand instances as well. This helped us immensely reduce the infrastructure costs coming from Goku Compactor.
Memory Allocation Statistics and Analysis (for GokuS)
GokuS is an in memory database storing the last 24 hours of metrics data. It uses hundreds of memory optimized machines, which are costly. Since infrastructure cost savings was always important, it was essential to understand the memory usage of the Goku application with an aim to find leads into cutting the usage per host and hence the capacity OR just be able to pack more data in. Note that the analysis was for virtual memory usage of the application. Since Goku has tight query latency SLA, it would not be ideal to go into a situation where memory is swapped from disk when the virtual memory usage nears the physical RAM capacity. The out of memory (OOM) score of the Goku process was not modified, and it would be killed with OOM when the virtual memory usage reached the physical thresholds.
GokuS application uses jemalloc library for memory management. More details can be found in this link. To summarize, it uses arena-based allocation and thread cache for faster memory-based operations in multi threaded applications. It also uses the concept of size classes (bins) for efficient memory management. Jemalloc via an api (malloc_stats_print) also prints the current usage statistics of the application. More information about the stats in present can be found on this main page (looks for “stats.”).
Fig 15: Memory usage increasing over time (60% -> 80%) and then stabilizing
We had noticed (Fig 15) that the memory usage of the storage nodes in GokuS would drop on every restart of the application and then over multiple days would rise by almost 20–25% before stabilizing. This was even when the number of time series stored on the cluster would remain the same. We ran address sanitizer enabled builds to detect any memory leak but didn’t find any. We were confused by this behavior and analyzed the jemalloc stats to know more.
Fig 16: Difference between active and allocated is the memory fragmentation
From the stats, we concluded there was memory fragmentation of almost 20–25 GB per host. At that point, we were not sure if this was internal fragmentation (memory consumed by allocation library but not actively used by application due to fixed allocation bucket sizes) or external fragmentation (caused by a pattern of allocation/deallocation requests). We decided to look for internal fragmentation first. We tracked the bin sizes that had high memory allocations and tried to map it to the object in code (one can do this via jeprof utility). Soon enough, we got our first hint.
The Case of Over Allocated folly::IOBufs
Before we get into the details of the issue, let’s understand the concept of bins or size classes in jemalloc context.
Fig 17: Bin sizes specified in the Size column, and spacing in bytes between two consecutive bins specified in column Spacing
A size class/bin/bucket size is the rounded up final capacity of the application’s allocation request. For example: an application memory allocation request of 21 KiB will be allocated 24 KiB based on size class in above Fig 17. Similarly, a memory allocation request of 641 KiB will be allocated 768 KiB. The internal fragmentation refers to memory consumed by the allocation library but not actively used by the application due to fixed allocation bucket sizes. By the spacing heuristic defined in the table above, jemalloc limits the internal fragmentation to 20%.
Fig 18: Large bin stats from a host. The 1st column is the size of the bin, the 3rd column is the total active memory allocated by the library using that bin, and the last column is the number of active allocations.
We were looking at the large bin category stats (Fig 18) and noticed a bin with size 1310720 (1.28MiB having ~43 GB allocated). This was the highest memory allocated in any bin, small or large. In the Goku application, we had added bookkeeping metrics on our end to track the memory consumers.
Fig 19: Approximate memory used by most allocated objects in Goku
From the metrics we added, we knew that finalized or immutable time series data consumed almost ~32GB of memory. This data was allocated using a third party folly library’s IOBuffer utility called IOBuf.
To summarize, the folly::IOBuf manage heap allocated byte buffers and buffer related state like size, capacity, and pointer to the next writable byte, etc. It also facilitates sharing (ref-cnt) of byte buffers between different IOBuf objects. More details on folly::IOBuf can be found here.
Goku created multiple folly::IOBufs of capacity 1 MiB to store finalized data. Since the bin size of the ~43GB allocated memory was 1.28MIB (1310720) (Fig 18), which was the next consecutive bin after 1MiB (Fig 17) and the number of active allocations was close to 32K (Fig 18) (32K * 1MiB ~ 32 GiB), we were almost sure that these allocations were for the finalized data. One thing we were confused about was why the bin size of the allocations was 1.28 MiB rather than 1 MiB, which Goku requests the allocation with. It would mean that folly::IOBuf internal logic would add some buffer (to the final malloc) to the capacity requested by the application. We browsed through the code of the folly version we were using.
Fig 20: Code snippets of the functions used in the create api for an IOBuf
We realized that the folly::IOBuf library was allocating extra memory for the SharedInfo structure in the same buffer which would be used for the data as well (Fig 20). Since the size of the SharedInfo structure was not 0.28 MiB, we made a change to allocate the data buffer in the application and transfer ownership to the IOBuf rather than creating it using the api provided.
Fig 21: Allocations after the transfer ownership fix shows 1 MiB bin allocation
As can be seen from Fig 21, after the fix, the finalized data started being allocated in the correct bin size of 1 MiB, and the memory allocated was reduced from 43GB to 32 GB. This fix helped us save 8–11 GB in each host in the GokuS clusters.
Async Logging Queue Sharing
This can be thought of as a code improvement. Notice how the second highest memory allocated is ~13 GiB (last line in Fig 18) using eight allocations bin of size 1610612736 bytes, i.e. 1.5 GiB. Previously, it used to be 16 allocations of the same, which would be ~26 GiB. We realized that these are statically allocated multiple producer multiple consumer queue again by folly library. The queues were used in the write path for async logging purposes (see the Fast Recovery section in blog 1). The code created eight queues with space for millions of elements in each queue, which would cost almost 1.5GiB per queue. These eight queues would be created per namespace. Since the queue was almost never fully utilized, we decided to implement a change that would support sharing of these queues amongst namespaces, thus providing almost ~13 GiB usage reduction per host.
Object Pooling and the Case for Empty Time Series Streams
As mentioned before, we wanted to target memory fragmentation.
Fig 22: Jemalloc stats show a huge number of allocation and deallocation requests (see nmalloc and ndalloc)
Another insight we got from the stats was that the number of memory allocation/deallocation requests called by the application was huge. Since GokuS was primarily a storage for time series objects, we decided to track and analyze the characteristics of the time series stored. These characteristics include churn rate, emptiness or sparseness, data pattern, etc. We reached the following conclusions:
- We observed a churn rate of almost 50% per day, i.e. almost half of the time series stored were deleted and replaced by new ones every day in the GokuS cluster.
- Nearly half of the time series had no data in the last four hours. This could mean either sparse data points in the time series or a time series receiving a burst of data points only once in its lifetime. The time series were mostly empty.
In the future, we have a project planned to store the time series in objects pools for reuse, representing empty parts of the time series in a more memory efficient way to avoid wastage due to overhead. With this, we are hopeful to resolve the fragmentation. With using object pools, an additional advantage would be having a more accurate view of memory usage by different objects.
Benefits of Architecture Changes
With the above improvements, we were able to reduce the virtual memory usage of the GokuS application by almost 30–40 GiB.
Also, even without object pooling, we have started observing flatter graphs (compared to Fig 15) for memory usage.
Goku Ingestor Improvements
Cost reduction of almost 50% was achieved in the ingestor component of the Goku ecosystem by replacing legacy unoptimized code. More details can be found here.
Instance Type Re-evaluation
We reevaluated almost all the Goku services’ hardware and moved to less compute intensive instance types. Although this provided huge cost benefits, we had to improve the write path in GokuS to be able to accommodate the new instance type. We already have plans to improve this further as well. For GokuL, we had to improve the ingestion process and data storage format as stated in the first blog for faster recovery with less powerful nodes.
We are pleased to say that while our client team helped reduce the time series storage by 40%, we were able to reduce our costs by 70%. In fact, we should be able to now accommodate a 30% organic increase in the storage without increasing capacity.
Acknowledgements
Huge thanks to the Observability team (our client team which manages the monitoring systems) at Pinterest for helping us in the cost savings initiatives by reducing the unused storage. Thanks to Miao Wang, Rui Zhang and Hao Jiang for helping implement the namespace feature above. Thanks to Ambud Sharma for suggesting cost efficient hardware alternatives for Goku. And finally, thanks to the Goku team for superb execution.