Monitoring Reinvented: Prometheus and Thanos Evolutionary Tale

Monitoring is the process of collecting and analyzing data to identify and resolve the performance of a platform and improve any reliability issues. It is essential for any organization that relies on IT systems to deliver products and services reliably. However, monitoring at scale can be challenging, particularly for organizations with complex and distributed IT environments. Prometheus is a popular tool that many organizations use to monitor their systems. It makes HTTP calls to endpoints, collects, and stores metrics as time series data. This blog discusses how to perform monitoring at scale using Prometheus and how this approach has evolved over the last few years.

Standalone Prometheus

Grafana is an open-source platform for data visualization, monitoring, and analysis. It enables users to create dashboards with panels and visualize data stored in Prometheus (and other sources). Traditionally an organization starts with a standalone Prometheus scraping metrics and a Grafana instance for visualization. As the scale grows, organizations tend to have multiple standalone Prometheus for different purposes and a Grafana instance with multiple data sources for visualization. It is also not scalable and comes with challenges.

Challenges of using Standalone Prometheus

A standalone Prometheus supports only vertical scaling. As the number of metrics and targets to be scraped increases, a single Prometheus instance struggles to cope with the escalating workload, eventually leading to memory constraints that can hinder its performance and functionality. The need is for horizontal scaling. Scaling horizontally allows the distribution of the targets being scraped, data, and queries across multiple instances, ensuring improved performance, fault tolerance, and resource utilization. This approach not only enhances scalability but also facilitates easier management.

Prometheus operates in a stateful manner and does not support replication, making it challenging to achieve high availability, which means the system’s availability without failing. Consequently, having multiple replicas with simple load balancing won’t suffice. For example, in the event of a crash, a replica might be back online, but it will not have data from the period it was offline. When queries go to the replica that went offline, we will receive empty results for the interval it was offline. We need to have a better way to ensure the high availability of the monitoring system.

Long-term storage of metrics (such as HTTP requests received per minute, request latencies, orders per minute, etc) plays a pivotal role in various operational aspects, including troubleshooting, debugging, root cause analysis, and capacity planning. Prometheus stores the data in the local disk. As the scale of data becomes extensive, maintaining all data on local disks can be very expensive. Implementing a solution that stores data that is less frequently accessed or older than x days in object storage is a cost-effective alternative that cannot be done by a standalone Prometheus.

When multiple users access a shared dashboard simultaneously, we observe our data sources processing identical queries repetitively. The challenge escalates with the introduction of a refresh interval, causing the same query (a search expression used to retrieve data) to reach our data sources at slightly different intervals. This results in data sources to process redundant requests. It is inefficient and causes an increased load on data sources and increased query response times. Caching query responses and reusing them can mitigate these issues and improve query response time.

So in short, a standalone Prometheus

Is unable to scale horizontally
Lacks high availability
Does not support long-term storage or archiving
No good caching mechanisms

In response to the limitations of standalone Prometheus in scaling to address the demands of highly scalable monitoring systems, organizations look for scalable solutions. In this quest, Thanos has emerged as a viable solution to enhance scalability and meet the evolving monitoring requirements.

What is Thanos

Thanos is an open-source project that serves as an extension of Prometheus, aiding in its scalability. Originally developed by Improbable, a British gaming technology company, it is currently maintained by CNCF. Thanos helps in making Prometheus highly scalable, allowing for horizontal scaling, and offering long-term storage and archiving capabilities. It introduces a few components into the monitoring system to provide these features.

Sidecar: Deployed alongside the Prometheus instance, it provides a Store API for effectively querying metrics from Prometheus and uploading them to object storage for long-term storage.
Querier: This component makes calls to the necessary sidecars running alongside Prometheus. It retrieves, merges, and deduplicates metrics, evaluates queries, and provides results. It is also known as the Query.
Query Frontend: Positioned as the point of contact to Grafana, it resides on top of the querier. This component is responsible for query splitting and result caching.
Store: This component helps in querying metrics from object storage, which were uploaded by the Sidecar, and returns results to the Querier.
Compactor: This component aids in downsampling and deleting historical data uploaded to object storage.

Note: There are 2 more components of Thanos(Receiver, Ruler). We did not use them.

Horizontal Scaling

Prometheus Operator and Querier, part of Thanos, are great enablers in achieving horizontal scaling. Prometheus Operator is responsible for dividing scrape targets into multiple manageable groups, ensuring that each group can be effectively handled by an individual Prometheus instance. It achieves this in an interesting way by leveraging native Prometheus actions and does not introduce new components.

- source_labels: [__address__]
separator: ;
regex: (.*)
modulus: 6
target_label: __tmp_hash
replacement: $1
action: hashmod
- source_labels: [__tmp_hash]
separator: ;
regex: "5"
replacement: $1
action: keep

It accomplishes this by using the hashmod action. In the above example configuration, the total number of shards is 6, and the configuration we are seeing belongs to the last shard. Prometheus initially computes a hash of the target’s address using the hashmod action and stores it in a temporary label __tmp_hash. In the subsequent step, it checks whether this hash matches the specified shard number, which in this case is 5. If it matches, the metrics are retained. otherwise, they will be discarded. This mechanism aids in the selection of a group of targets belonging to a specific shard. Additionally, we can implement functional sharding.

Monitoring Setup with Horizontal scaling

With the help of the hashmod action, targets can be split into multiple standalone Prometheus instances, and the Querier helps in fetching metrics from all of them. The Querier establishes connections with sidecars that run alongside Prometheus instances using the Store API. Periodically, it requests information from these sidecars through Info calls, obtaining details about the metrics and label sets stored in the connected Prometheus instances. When a query is made to the Querier, it uses the metadata it has previously collected to select the required Prometheus sources. Subsequently, It fans out the query to these selected Prometheus instances to gather results. Once the results are collected, the Querier merges them and sends them back to the user.

High Availability

We can achieve High availability by having 2 replicas of the same Prometheus. We can use Querier instead of simple load balancing on top of them. When a query is made on Querier, It fetches data from both replicas, deduplicates it, fills any gaps, and sends the results back to the user. This helps in hiding the empty results of a replica when it is down.

Monitoring Setup with HA

Also having multiple Prometheus replicas does not necessarily mean we have duplicate data. The scrapes and evaluations won’t happen exactly at the same time. As a result, a query executed against each replica would give slightly different results, improving the overall accuracy of the query.

Long Term Storage

Prometheus stores data on disk as TSDB blocks, which are immutable, compressed segments organized by time. The Sidecar, operating alongside each Prometheus server, can upload Prometheus TSDB blocks to a designated object storage system. Simultaneously, the Store can query the object storage. Store maintains a small amount of information about the TSDB blocks present in object storage. It uses that information to query efficiently from object storage and gives results back to the Querier. Therefore, we don’t need a significant amount of local disk space. Compactor helps in downsampling historical data. It fetches TSDB blocks from object storage, creates downsampled blocks, and uploads them back to object storage. It also assists in deleting TSDB blocks from object storage after the retention period. By combining these components, We can have a comprehensive long-term storage solution.

Monitoring Setup with Long term retention

As Sidecar actively uploads TSDB blocks to object storage, when a query is initiated, the Querier fans out the query to the Store along with other Prometheus instances. The Store then queries the metrics from object storage and returns them to the Querier. Querier again processes that and gives back.

Caching

This can be effectively implemented by using the Query Frontend. The Query Frontend serves as the first component before the Querier, being the initial point where queries are received. It handles query splitting and results caching. Query frontend can do in-memory or can use Memcached or Redis for caching query results.

Monitoring Setup with Caching

When a long-duration query comes to the Query Frontend, it splits the query into multiple short-duration queries, enhancing parallelization. This approach also helps prevent out-of-memory (OOM) issues caused by long-duration queries. Subsequently, the Query Frontend executes these queries in parallel on downstream Queriers and returns the results. Additionally, it caches query results to reuse them on subsequent queries. This improves query response times and reduces load.

Monitoring Reinvented: Prometheus and Thanos Evolutionary Tale