Failing to Auto Scale Elasticsearch in Kubernetes

k8s logo, fire, and elasticsearch logo

Introduction

In Lounge by Zalando, we run an Elasticsearch cluster in Kubernetes to store user facing article descriptions. Our business model is such that we receive about three times the normal load during the busy hour in the morning and therefore we use schedules to automatically scale in and out applications to handle that peak. If scaling out in the morning fails, we face a potential catastrophe. This is a story of one such case.

First anomaly

Early Tuesday morning, our on-call engineer received an alert about too few running Elasticsearch nodes. We started executing the playbook to handle such a case, but before we had time to go through all the steps, the missing nodes popped up and the alert closed on its own. Catastrophe avoided for now, but after a cup of coffee, follows the root cause analysis.

Investigating the logs it turned out that the cluster had failed to fully scale down for the night. The cluster was configured to run 6 nodes during the night, but it got stuck running 7 nodes.

To understand why that happened and why it is interesting, a little bit of context is required. We run Elasticsearch in Kubernetes using es-operator. Es-operator defines a Kubernetes custom resource, ElasticsearchDataSet (EDS), that describes the Elasticsearch cluster. It monitors changes to it and maintains a StatefulSet that consists of pods and volumes that implement the Elasticsearch nodes. We’ve configured our cluster so that the pods running it are spread across all AWS availability zones, and Elasticsearch is configured to spread the shards across the zones.

For us, the schedule based scaling is implemented by a fairly complex set of cronjobs that change the number of nodes by manipulating the EDS for our cluster. There’s separate cronjobs for scaling up at various times of day and scaling down at other times of day.

The pods in a StatefulSet are numbered and the one with the highest number is always chosen for removal when scaling in. Just before the nightly scale got reached, we were running the following pods in the shown availability zones:

es-data-production-v2-0 eu-central-1b es-data-production-v2-1 eu-central-1c es-data-production-v2-2 eu-central-1b es-data-production-v2-3 eu-central-1c es-data-production-v2-4 eu-central-1c es-data-production-v2-5 eu-central-1c es-data-production-v2-6 eu-central-1a

The pod to be scaled in next is es-data-production-v2-6. First step in this is for es-operator to drain the node, i.e. request elasticsearch to relocate any shards out of it. Here though, the node to be drained is the only one located in eu-central-1a. Due to our zone awareness configuration, Elasticsearch refused to relocate the shards in it. Es-operator has quite simple logic here: It requests for shards to be relocated, check whether it happened and keep retrying for 999 times before giving up. This kept happening throughout the night and quite unbelievably, retries were done just two minutes after we got the alert. Then, es-operator carried on with scaling out and the problem resolved itself. The timing here is quite surprising, but occasionally such things occur.

Initial root cause analysis

Something in the above is not quite right though. The intended behaviour of es-operator is as follows: It constantly monitors updates to EDS resources and if change is observed, it compares the state of the cluster to the description and starts to modify the cluster to match its description. If, during that process, EDS gets changed one more time, es-operator should abort the process and start modifying to cluster to match the new desired state.

This was the case for us exactly. Es-operator was still processing EDS update to the scale in for the night while it received another EDS update to start scaling out for the morning. We spent much of the next day tracing through es-operator source code and finally realised there was a bug regarding retrying on draining nodes for scaling in: In this one specific retry loop, context cancellations are not reacted on. The bug is specific to draining a node and doesn’t apply to other processes. It’s fixed now, so remember to upgrade if you are running es-operator yourself.

Still something is not quite right. Why did this happen on Tuesday and never before? We never scale into less than 6 pods and as explained above, the pod to scale in is always the one with the greatest number. Therefore, the pods numbered 0 to 5 should remain untouched. The pods running the Elasticsearch are run as a StatefulSet by es-operator. If that StatefulSet was using an EBS backed volume, Kubernetes would guarantee to not move the between zones. We, however, don’t store unrecoverable data in our Elasticsearch, thus we can afford to run it on top of ephemeral storage. Nothing is strictly guaranteed for us then. Normally, pods remain quite stable in a zone nevertheless, but on Monday, the day before the first anomaly, our Kubernetes cluster was upgraded to version 1.28. This process likely has affected the pod scheduling across nodes in a different availability zone, though we have not done a full deep dive into the upgrade process to confirm this.

The first fix that didn’t work

As a quick fix, we just increased the number of nodes running during the night. This way, the nightly scale-in job wouldn’t try to drain es-data-production-v2-6, the last node in eu-central-1a and it wouldn’t get stuck the way it did the previous night. We might want to consider something else for a longer term, but this should stop us from failing to scale out the next morning.

Still, the next morning, we received the exact same alert once again. And after a few minutes, the alert closed on its own the same way as the day before.

This time we were unable to scale in from 8 to 7 nodes, which did work fine the day before. Looking at the node distribution:

es-data-production-v2-0 eu-central-1b es-data-production-v2-1 eu-central-1c es-data-production-v2-2 eu-central-1b es-data-production-v2-3 eu-central-1c es-data-production-v2-4 eu-central-1c es-data-production-v2-5 eu-central-1c es-data-production-v2-6 eu-central-1a es-data-production-v2-7 eu-central-1a

Why was es-operator not able to drain es-data-production-v2-7? This time it’s not the last node in eu-central-1a.

Digging into this revealed another bug in es-operator. The process for scaling in a node, in a bit more depth, looks like the following:

Mark the node excluded (cluster.routing.allocation.exclude._ip) in Elasticsearch. This instructs Elasticsearch to start relocating shards from it.
Check from Elasticsearch whether any shards are still located in the given node. If yes, repeat from the beginning.
Remove the corresponding pod from the StatefulSet.
Clean up node exclusion list (cluster.routing.allocation.exclude._ip) in Elasticsearch.

Pondering about the above, you are likely to guess what was wrong this time. If the scaling down process gets interrupted, the clean up phase is never executed and the node stays in the exclusion list forever. So, es-data-production-v2-6, which failed to scale in the day before, was still marked as excluded and Elasticsearch was unwilling to store any data in it. In effect, es-data-production-v2-7 was the only usable node in eu-central-1a.

The second fix

Manually removing the “zombie” node from the exclusion list is simple, so we did exactly that to mitigate the immediate problem.

Fixing the underlying bug in a reliable and safe way is much more involved. Just adding a special if clause for cleaning up in case of cancellation would solve the simple instance of this problem. But we are potentially dealing with partial failure here. Any amount of if clauses wouldn’t solve the problem when the es-operator crashes in the middle of the draining process. There’s a PR in progress to handle this, but at the time of writing the bug still remains and we currently accept the need to deal with these types of exceptional situations manually.

Finally

As an embarrassing postlude to this story, we received the same alert one more time the next day. The quick fix we did the day before only touched the major nightly scale down job, but ignored another one related to a recent experimental project. It was a trivial mistake, but enough to cause a bit of organisational hassle.

Well, we fixed the remaining cronjob and that was finally it. Since then we’ve been running hassle free.

What did we learn from all this? Well, Read the code. For solving difficult problems, understanding the related processes in abstract terms might not be enough. The details matter, and the code is the final documentation for those. It also mercilessly reveals any bugs that lurk around.

We're hiring! Do you like working in an ever evolving organization such as Zalando? Consider joining our teams as a Backend Engineer!