Migrating From Elasticsearch 7.17 to Elasticsearch 8.x: Pitfalls and Learnings

With Elasticsearch, moving from one major version to another is a big jump. Usually, it is updated in gradual increments, minor to minor version. It is difficult to make a big move. There's no official step-by-step and usually, it just doesn't happen. So, how did we approach it? Read on to find out.

Maryna Cherniavska

Senior Software Engineer

Posted on Nov 20, 2023

Tags:

What this article is about

What kind of changes we had to make to the codebase
How we did the actual upgrade
What challenges we faced
How we did the data transfer
How the data was kept in sync

What this article is not

A step-by-step guide on how to upgrade Elasticsearch (read on to find out why).

Who we are

We are a team from the Search & Browse department, the department in Zalando that is responsible for all things search (read: relevance, personalisation, sorting, filters, full text search, ... in short, everything that forms the search experience). The search applications are using Elasticsearch as the main datastore, so we are also the ones responsible for its well-being.

Why upgrade

We have been using Elasticsearch for a long time. It was upgraded more or less on a regular basis, but we were always a bit behind the latest version (Elastic has a regular release schedule; the releases are all scheduled well in advance). We were on version 7.17 for a while, and while we were pretty happy with it, we still had a few reasons to upgrade to 8.x.

First, we wanted to use the new features that were introduced in 8.0. Namely, the approximate kNN (k nearest neighbors) - or ANN-search. The vector search was already used in Search & Browse, but it was the exact kNN search, the brute-force and less performant one. What Elastic says about the approximate vs exact kNN search is this:

In most cases, you’ll want to use approximate kNN. Approximate kNN offers lower latency at the cost of slower indexing and imperfect accuracy.
Exact, brute-force kNN guarantees accurate results but doesn’t scale well with large datasets. With this approach, a script_score query must scan each matching document to compute the vector function, which can result in slow search speeds. However, you can improve latency by using a query to limit the number of matching documents passed to the function. If you filter your data to a small subset of documents, you can get good search performance using this approach.

There is also a great article about ANN on Elastic blog by Julie Tibshirani - read it, you won't regret it.

Second, we also wanted to be on the latest version for performance and security reasons, because obviously, every new release has a lot of security fixes and performance improvements.

Why it's difficult to upgrade

You don't just upgrade Elasticsearch

Boromir telling you that you don't just upgrade Elasticsearch

Usually, Elasticsearch is updated in gradual increments, minor to minor version, and it's difficult, not to mention dangerous, to make such a big move as going from one major version to another. Also, the documentation on the official website, while ample, is pretty disorganized, and there's no complete step-by-step for such an endeavor. And even if you were to gather all the information from the docs, it's still not enough. You need to know what to do with your data, how to keep it in sync, and how to make sure that the new version is working as expected.

In Zalando, the size of data is pretty massive. We have millions of articles in each country, and while the gender root page for women in Germany will show you 450k items, it's simply not the full picture. This number is just how many items at most get scanned to show you the first page. The actual number of items is much higher. And we currently have 28 domains (country + language combos), each with its own catalog. So in short, we have a lot of data, and we need to make sure that it's not lost or corrupted during the upgrade.

How we approached the upgrade

Another reason why one can't just go and upgrade Elasticsearch is because, well, it's not an island.

What I mean is, it's not some independent entity that has a value all by itself. It's our datastore, and it's used by a lot of our services. So before one goes and upgrades this massive thing, one should think of possible breaking changes in the product. And also, one should think about how it changes the actual usage of Elasticsearch.

The main search application in Zalando, the one that deals directly with Elasticsearch queries, is called Origami. From the description on its (internal) repository page:

Origami is the Zalando Core Search API. It provides a powerful information retrieval language and engine that integrates several microservice components built by the Search Department. In the landscape of Zalando Search and Browse platform, Origami is the connector - coordinating all search intelligence to serve correct search results to customers.
Origami builds on top of Elasticsearch and our internal/Zalando-specific suite of APIs. These APIs will facilitate composing/serving search and discovery, navigation, and analytics functionalities.

The application is written in Scala and using a Java High Level REST Client, which got deprecated in Elasticsearch 7.15.0 and replaced by ElasticSearch Java API client, so first of all, we had to update the codebase to use the new client.

Updating the codebase

However, updating the codebase was also not a one-step task. (This just goes deeper into the rabbit hole, doesn't it?)

Origami has 443k lines of code in 846 files. Of course, a lot of these files are the configs and tests and test resources, so the actual number of Scala files is much lower. But still, it's a lot of code, and a lot of it is dealing with Elasticsearch.

Upgrading the Elasticsearch API to be able to work with version 8.x also represented a choice. We could either use the official Elasticsearch Java API Client, or we could use the Elasticsearch Scala client which seemed to be quite popular and had a lot of contributors (and stars) on GitHub. Both options were available and viable. Both had their pros and cons.

With the Elasticsearch Java API, the advantages would be:

The library is officially supported and its versions match the Elasticsearch releases;
There is a ready-made DSL for all the REST APIs;
It’s open source and the code is available on GitHub. The license is Apache License 2.0.

However:

It’s in Java. This means that all the lambda types, collection types, etc. are not directly interoperable and special transformations should be done within our code;
We’re missing on the other Scala advantages like built-in immutability, null safety and so on.

The unofficial Scala client is advertised as:

Providing a type-safe, concise DSL;
Integrating with standard Scala futures or other effects libraries;
Using Scala collections library over Java collections;
Returning Option where the Java methods would return null;
Using Scala Durations instead of strings/longs for time values;
Supporting typeclasses for indexing, updating, and search backed by Jackson, Circe, Json4s, PlayJson and Spray Json implementations;
Supporting Java and Scala HTTP clients such as Akka-Http;
Providing reactive-streams implementation;
Providing a testkit subproject ideal for tests.

The disadvantages, however, could not be ignored:

It’s not official and the releases are not closely following Elastic’s release schedule. At the time we were looking at it, Elasticsearch was already at v8.7 and this library’s last version was 8.5.4. (It could work with Elasticsearch up to version 8.6 though);
Because it did not implement all the new features, there was no DSL for kNN search. KNN search was still available via sending a pure JSON query, but it was not a pretty option.

In the end, we decided to go with the Elasticsearch Java API client. The main reason was that it was officially supported and the releases were closely following the Elasticsearch releases, and it wouldn't just disappear into thin air in the unlikely case when its creator would suddenly want to quit. Also, it had DSL for all the REST APIs. The absense of the kNN search DSL in the Scala library was really disappointing, because approximate kNN search was one of the main reasons why we wanted to upgrade in the first place.

So, the choice was made.

But.

As I said before, this was a large application.

How does one make sure that no existing functionality is going to break when upgrading the API? How does one make sure that all the existing queries are still going to work?

Obviously, you write a test.

Writing a test

There was one more decision that we made while selecting a migration strategy, and that was to start with compatibility mode. This meant that we would use the Elasticsearch High Level Rest Client from version 7.x, but in the compatibility mode, so that it would instruct Elasticsearch 8.x to behave like the old client. This way we would be able to upgrade the Elasticsearch cluster first, and then upgrade the client gradually. With this approach, we would avoid rewriting too much code at once. And afterward, we would be able to use one of the transition strategies, recommended by Elasticsearch, to gradually upgrade the client.

This approach was also a good fit, since we assumed that we might have a time during the transition phase when the application would have to deal with both Elasticsearch 7.x and Elasticsearch 8.x. Because our Elasticsearch was a multi-cluster deployment, it would be practically impossible to upgrade in one go. We would have to start with less mission-critical clusters, and then gradually move to the more important ones. So, we would definitely have to deal with both versions of Elasticsearch for some time.

So how to write such a test?

This is where Testcontainers shine. Basically, we had a helper class looking like this:

object ESContainers {
  val Version7179 = "7.17.9"
  val Version86 = "8.6.2"
  val Version88 = "8.8.2"

  val VersionDefault = Version7179

  def initAndStartESContainer(version: String = VersionDefault): ElasticsearchEndPoint = {
    val container =
      new ElasticsearchContainer(s"docker.elastic.co/elasticsearch/elasticsearch:$version")
        .withReuse(true)
        .withCreateContainerCmdModifier(cmd => cmd.getHostConfig.withCapAdd(Capability.SYS_CHROOT))
    container.start()
    val hostAndPort = container.getHttpHostAddress.split(":")
    ElasticsearchEndPoint(hostAndPort(0), hostAndPort(1).toInt, container)
  }
}

And then, in the test, we would just do this to start Elasticsearch with the version we needed.

private lazy val endpoint = ESContainers.initAndStartESContainer(Version88)

Since at some point we'd have to deal with both versions of the API, we had to test three combinations:

Elasticsearch 7.x with Elasticsearch 8.x API;
Elasticsearch 8.x with Elasticsearch 8.x API;
Elasticsearch 8.x with Elasticsearch 7.x API.

And with each, we needed to make sure that the common types of actions, done by the application, continue to work as expected.

So this is exactly what we did. We wrote three test classes:

NewClientWithOldElasticTest
OldClientWithNewElasticTest
NewClientWithNewElasticTest

Why is there no OldClientWithOldElasticTest? Because we already knew that it was working. It was what the application we already had.

Each class was checking that the application was able to do the following:

Create an index;
Create a document;
Create kNN vector mappings;
Index kNN vector data;
Search for a document with a kNN query;
Delete an index;
Close the client.

The tests were not covering all the queries that we ran - only the common types. But even with this simplified approach we were able to discover a few issues, for which we had to make changes to the codebase.

Issues discovered and fixes applied

Elasticsearch 8 deprecated the _type field in search response, so we had to remove it from all the test case resources that represented example JSONs for the expected response.
Elasticsearch 8 didn't allow null in the is_write parameter when creating an alias for the index. Therefore, code was added to set this flag explicitly.
Range query based on date/epoch_second didn't work with upper/lower bounds specified as numbers. (According to the Elastic team, it was a feature and would not be fixed). Due to that, the range boundaries had to be stringified before being passed to Elasticsearch.
In Elasticsearch 8, a cluster setting called action.destructive_requires_name now defaults to true instead of false. Since our e2e tests were dropping all test indexes by wildcard before starting, they all started crashing. So, a change was introduced to update this setting on a cluster to allow the test suits run this action. The method that was doing it was only used in test suites, because for a real production cluster, it's pretty unsafe.

Moreover, when we started to switch the other, more detailed integration tests to Elasticsearch 8, we found an issue that was a little more involved. Some of those tests started to fail with the following error:

{
  "type": "query_shard_exception",
  "reason": "it is mandatory to set the [nested] context on the nested sort field: [trace.origami.timestamp].",
  "index_uuid": "_xvEa8gNSFyCDm0aFXqYhg",
  "index": "article_1"
}

That seemed to refer to the sort clause that we had in the e2e test suite:

"sort": [
  {
    "trace.origami.timestamp": {
      "order": "desc"
    }
  }
]

The page about sorting on a nested field for ES 8.8 (current at that time) says that there should be a path specified in a "nested.path" clause of the sort. However, the same page for ES 7.17 states exactly the same, but the query still runs fine without that clause.

So something changed between the versions in such a way that it started erroring out in ES8, whereas in ES7 it was working fine, despite the docs stating that the parameter is non-optional (the thread I created on ES discussion board suggests there was a bug and it was fixed). So, we had to add the nested.path clause to the sort clauses in the queries that were sorting on nested fields, meaning that the sort clause from the example above would now look like this.

"sort": [
  {
    "trace.origami.timestamp": {
      "order": "desc",
      "nested": {
        "path": "trace"
      }
    }
  }
]

Deprecating Elasticsearch settings in preparation for 8.x migration

Summary of changes:

Remove fixed_auto_queue_size thread pool. It’s replaced with the normal fixed thread pool configuration.
Replace deprecated transport.tcp.compress.
Replace node role settings with new node.roles settings (see one and two).
Due to an existing bug, the coordinating role needs to be set as a default which can in turn be overridden by setting the node.roles environment variables with specific values.
Remove deprecated gateway.recover_after_master_nodes setting.
Add human approval to prevent upgrading master nodes before data nodes.
Explicitly disable the serial GC using -XX:-UseSerialGC to avoid the following error messages during start up: text Error occurred during initialization of VM Multiple garbage collectors selected even though -XX:+UseZGC or -XX:+UseG1GC is explicitly enabled. Most likely an intermediate script was logging this message. In ES 8.x the container can unsuccessfully exit because of this error.
Coordinating nodes are enabled by default by specifying an empty value.
Data nodes will only have the “data” role defined.
Monitoring checks had to be updated because the role abbreviations changed and became stricter than before.

How we did the actual upgrade

Finally, it seemed that the application was prepared to work with non-homogenous Elasticsearch versions. At last, it was time to upgrade the Elasticsearch cluster itself.

There is a documentation page with some advice about going from 7.x to 8.x, and it states that first, one should move to 7.17. From there, it is recommended to use an Upgrade Assistant tool to help prepare for the upgrade. As an alternative, is also recommended to use the Reindex API to reindex the data from the old version to the new one.

So in short, Elasticsearch provides two ways to upgrade:

The rolling upgrade approach;
Upgrading via reindex.

First one is upgrading live. It means that you upgrade the cluster node by node, and the cluster is still available during the upgrade. The second one is upgrading via reindex. It means that you create a new cluster, and you reindex the data from the old cluster to the new one. Then you switch the traffic to the new cluster and shut down the old one.

In general, Elastic recommends doing a rolling upgrade in a following way:

Upgrade the data nodes first;
Upgrade other non-master nodes (ML-dedicated, coordinating, etc.);
Upgrade the master nodes.

This is because the data nodes can join the cluster with the master nodes of a lower version, but older data nodes can't always join the newer cluster. So, if you upgrade the master nodes first, the data nodes might fail to join it, and the cluster will be unavailable.

In general, the rolling upgrade is the recommended way to upgrade, because it's less disruptive. However, in our case, it represented too many dangers. First of all, we have a multi-cluster deployment, and the clusters are pretty large, so we're talking about some terabytes of data. It would take a lot of time to upgrade the cluster node by node, and during this time, the cluster would be in a mixed state, with some nodes being upgraded and some not, with relocating shards, and in general in a degraded state.

That, in itself, wouldn't be so scary. What would indeed be bad is if something were to go wrong. If we faced data loss, we'd have no choice but to go with restoring the data from snapshots and then resetting the input streams to bring the data up to date. This would take quite some time, because we'd have to do it for all the indices in the cluster, and during all this time, the catalog of products would either be unavailable or would have stale or partial data.

So, we decided to go with the second option, the reindexing. It meant that we'd have to create a new cluster, reindex the data from the old one, and then gradually switch the traffic to the new cluster. It would take more time, but it would be way less risky and less disruptive, because when the data would be in sync, going to the new cluster would be just a matter of switching the routing. If something went wrong, the rollback procedure would be almost instantaneous as it would again be just the routing switched back.

And last but not least, having both clusters running side by side would give us time to test the new cluster and make sure that it was working as expected and performed at the same level. We could first test if with shadow traffic, and then gradually increase the traffic to the new cluster and decrease it on the old one.

Procedure per cluster

The procedure for each of out cluster would be similar and would include the following steps:

Deploy ES8 cluster.
Setup monitoring.
Create index templates (because if we were to index the data from the old cluster, we'd have to make sure that the new cluster has the same index templates as the old one).
Restore data from the latest snapshot.
Set up the shadow intake traffic. This meant that the data would gradually converge with the old cluster, but the queries would still be served by the old cluster. If we were to consider the moment the snapshot was taken as point A and the moment shadow intake was enabled on the new cluster as point B, then it would mean that we have full data from beginning to A, and then from B to the end.
That left us with the gap between points A and B, so the next step would be to perform the data update by resetting the data streams to the point of just before the snapshot was taken.
Shadow query traffic. This would be performed gradually, with monitoring for errors.
Verify that the new cluster works as expected and compare the cluster performance with the old one.
Switch the live traffic to ES8 cluster (again, gradually shifting the percentages).
Remove old traffic and clean up old cluster resources.

If these steps sound familiar, it is because they are. It is basically the Blue/Green procedure that is usually used for disaster recovery (failover cluster), or for testing something new. The only difference is that we were using it for the one-time Elasticsearch cluster upgrade and not keep the second cluster around. (We are also looking into applying the same approach for the failover cluster, but since our deployments are very large and complicated, we're still getting there.) This Blue/Green approach was also used by the team behind Zalando Lounge which has a separate catalog of products, also backed by Elasticsearch, so we had some in-house experience to compare with.

Routing and shadowing

The whole mechanism is based on a delicate balance of routing and shadowing. We use an open-sourced solution called Skipper as an ingress controller, which gives us access to filters. For the routing, we're using a custom resource type called RouteGroup. For example, to ensure that the intake pipeline ingests data into the new cluster, the route group configuration needs to be modified to shadow the intake traffic for the /bulk and /_alias/{index}_write endpoints. Here is a somewhat simplified example configuration for shadowing the specified endpoints:

apiVersion: zalando.org/v1
kind: RouteGroup
spec:
  hosts:
    - cluster-name-{{{CLIENT}}}.ingress.cluster.local
  backends:
    - name: backend-old
      type: network
      address: "http://backend-old.ingress.cluster.local"
    - name: backend-new
      type: network
      address: "http://backend-new.ingress.cluster.local"
  routes:
    ## match to shadow /_bulk, /_alias/{index}_ad*_write to new backend with ES8
    - pathSubtree: /
      pathRegexp: ^/(_bulk|_alias/(index-name-template)_[\d]+_write)$
      predicates:
       - HeaderRegexp("elasticsearch-index-name", "^(index-name-template)_[\d]+($|_.*)")
      filters:
       - teeLoopback("intake_shadow")
       - preserveHost("false")
      backends:
       - backendName: backend-old

    ## shadow "intake_shadow" matched requests to new backend with ES8
    - pathSubtree: /
      pathRegexp: ^/(_bulk|_alias/(index-name-template)_[\d]+_write)$
      predicates:
       - HeaderRegexp("elasticsearch-index-name", "^(index-name-template)_[\d]+($|_.*)")
       - Tee("intake_shadow")
       - Weight(2) ## hack required to not match route with Traffic() and teeLoopback()
      filters:
       - preserveHost("false")
      backends:
       - backendName: backend-new

But that's not all. Before shadowing the intake, the mapping templates should be created. One way to do it would be to just grab them and recreate to the new cluster. But that would mean that we'd have to do it manually, and also we might miss the updates to them if they were to happen while the clusters were still running side by side. Since the templates are stored in our code repos and updated (based on the version) on application restart, the traffic related to template creation also should have been shadowed, so we had to capture this specific traffic too. Snippet of code (shortened):

spec:
  routes:
    - path: /:index/_mapping
      predicates:
        - HeaderRegexp("elasticsearch-index-name", "^(index-name-template)_[\d]+($|_.*)")
    ## <...>
    - path: /_template/*
      predicates:
        - HeaderRegexp("elasticsearch-index-name", "^(index-name-template)_[\d]+($|_.*)")

Monitoring

The whole process would make no sense if we were going blind. Since it was a multistep procedure, we needed to see how each step is changing the data, affecting the cluster, performing compared to the old cluster, etc. So we needed to set up monitoring. It was based on creating Lightstep streams and setting up the dashboards in Grafana. The dashboards were showing the traffic from both clusters side by side per endpoint, and the key metrics like latency and error rate. We also monitored CPU and memory consumption via Kubernetes.

One of the most important things was that the data would be in sync, so the boards also had index sizes and the difference between them for the old and new cluster. This way, we could see if say restoring from the snapshot was indeed successful and if the follow-up of shadow intake and stream resetting was resulting in data converging in the end.

Alerting

And last but not least, before each new cluster went live, we had to update alerts and checks that were set up on the corresponding old cluster. We had to make sure that the alerts were pointing to the new cluster and that the checks were still working as expected. We also had to make sure that the alerts were not firing during the upgrade.

Backing up the data

And of course, as soon as the new cluster went live serving queries and the data on the old cluster stopped being updated (or preferably before that), we set up the snapshotting. We had to make sure that the data was backed up, using the same policies that the previous cluster was using.

Challenges we faced

The process of upgrading the cluster was not without challenges. Some of them were expected, some were not, and some were purely based on people never having performed some procedures before, or on something slipping one's attention.

One such thing resulted in duplicates being shown in the product catalog country-wide, because there was a routing error while switching the country index from an old cluster to the new one, so one extra index was created automatically (and erroneously) and for some time two different indices with duplicate content were existing behind the same alias. But that was quickly fixed, and the duplicates were removed by just dropping the mistakenly created index. (And hey, it's better to show the product twice than not to show it at all, right?)

In general, the whole process was an amazing learning experience, and the whole team is now better prepared for the next upgrade and feels more confident tackling Elasticsearch in general. So, while assuredly sh*t still can and will happen, what matters is how you deal with it and what you learn from it.

For example, the difficulty experienced by team members while restoring the data was a good indicator that our existing procedure of restoring from snapshot was extremely fussy and error-prone, which resulted in looking for alternative solutions, like Kibana-based workflows, to make the process more straightforward and more obvious. Historically, we were using custom scripts and our CI pipeline for that, but now we're aiming to get our engineers better acquainted with Kibana. The scripts are still the default way, but we're getting there.

Success!

As always after a big project, we had a retrospective, and the team was pretty happy with the results. The upgrade was successful, and the new cluster was performing at the same level as the old one. The new features were working as expected, and the new cluster was stable. The monitoring was set up, and the dashboards were showing the data in sync. The alerts were firing as expected, and the checks were working. So all in all, it was a success.

But you know what?

Products keep upgrading. Progress is the only constant thing in the world. So, we're already looking into the next upgrade, and we're already thinking about how to make it even better.

And we will keep evolving, because that's what we do.

We're Zalando. We dress code.

(See what I did here? Even though I can't take any credit for this. This is a slogan that we once had on our company hoodies!)