Breaking the Loop: How we migrated our backup catalog for 250+ MySQL clusters to AWS

This post was originally published internally on May 9, 2025 and has been reworked for a public audience.

Authors: Ioannis Androulidakis, Mohammed Gaafar

Intro

The Database Engineering team at Booking.com is constantly looking for ways to improve database reliability and support scaling our business. In 2025 we completed a major milestone in our effort to modernize our infrastructure: we seamlessly migrated the backend of the orchestrator that schedules and manages the daily backups of 250+ production MySQL clusters. More specifically, we moved from a self-managed MySQL database running on premises to a managed Amazon RDS MySQL database running in the cloud. This blog post delves into the challenges we faced, the solutions we implemented, and some key lessons we learned along the way.

We are moving to the cloud (gradually)

Cloud adoption is a journey, not a switch. Over the past few years the adoption of AWS solutions at Booking.com has been growing rapidly across different business units, allowing teams to be more autonomous, run their databases in the cloud and solve known issues with their old on-premise setup. Like customer teams, we are running our own databases to power the core services that we offer and manage databases efficiently at scale. To name a few: automatic failover of writable primaries, auto-scaling of read-only replicas, service discovery, capacity planning, online schema changes, user access management, volume backups, etc.

Traditionally, the Database Engineering team has been running databases on-premises. In the emerging era of cloud databases, we wanted to compare our in-house offerings with existing cloud databases and bridge the gap between the two. Knowing that we lacked deep, hands-on experience with the operational realities of running critical infrastructure on AWS, we saw this as a unique opportunity to learn more about cloud databases while also improving the reliability of our systems. Long story short, we decided to prioritize the migration of one mission-critical database that we fully own from our premises to the cloud and benefit from the following:

Solve known, business-critical issues related to databases running solely on-prem (e.g., circular dependencies, infrastructure costs, maintenance effort, etc.).
Get first-hand experience in using managed Amazon databases (RDS), understand its features and limitations and familiarize with its ecosystem.
Align with the company’s cloud strategy. Use the available Infrastructure-as-Code (IaC) tooling to understand the new operational model for managed services and improve the foundation that we build for other teams to use cloud databases (dogfooding).

NOTE: at the time of this migration we were also experimenting with Amazon Aurora. However, due to business priorities our goal was to migrate one of our databases to Amazon RDS.

Bacula and database backups

Database backups are crucial for compliance, business continuity and, of course, disaster recovery. At Booking.com, and for quite some years now, we have been using Bacula in conjunction with an in-house DBA control plane to orchestrate database backups. Bacula is an enterprise-level backup system that manages, automates and verifies backups of stateful services, including MySQL clusters — or replication chains, as we prefer to call them internally. Much like any backup orchestrator out there, Bacula needs to maintain state in order to operate. More specifically, it stores information about available clients as well as metadata about scheduled/running/finished backup and restore jobs. The core Bacula backup system components (Director, File Daemon, Storage Daemon) are shown in the figure below:

Figure 1: Bacula architectural diagram

The Bacula Director (controller) uses a catalog to store metadata for backup and restore operations, including a record of all Jobs, Clients, and Files backed up, together with useful information such as permissions, dates and physical storage locations on volumes. The actual file data flows from Bacula’s File Daemon which runs inside a clone source (dedicated MySQL replica to take backups from) to Bacula’s Storage Daemon which runs offsite and writes the data it received on a different medium and location.

Bacula provides various utilities and supports multiple backends for its catalog, including storing it in a MySQL schema. Back when we adopted Bacula as the backup solution for our MySQL clusters, we decided to create the bacula schema and host it ourselves on dbadb, a MySQL replication chain that is home to various DBA-related schemas, is entirely owned by the Database Engineering team and is (still) running on premises with replicas across 3 regions within Europe:

Figure 2: dbadb on-prem MySQL replication chain (3 regions)

Nowadays our Bacula installation is responsible for taking backups of 250+ MySQL replication chains daily. We take both regular and immutable backups in each of our 2 main regions in Europe which sums up to ~1000 new backups per day. These backups are stored for at least 7 days on our storage servers, depending on the business criticality of each MySQL replication chain.

On top of this, and to ensure integrity, we have set up an automated framework that runs on Booking Kubernetes Service (BKS) and performs restore tests from these backups to verify their usability in a real disaster recovery scenario — ensuring zero errors when restoring backups is an essential step in modern backup systems (also see the 3–2–1–1–0 backup rule and its variations). Additionally, developers and data owners are able to perform backup-related operations on-demand via an internal portal that we provide, e.g., take a new backup or restore an existing one. Also, backups are occasionally used to seed newly provisioned database servers with data to reduce their spin-up time, especially if a MySQL replication chain lacks a clone source.

Figure 3: Internal portal — Database Backups

Identifying a long-lived circular dependency

As mentioned earlier, one of the primary purposes of a backup system is to enable data recovery to a certain point of time in the past in case of data loss or corruption. Let’s imagine the following undesirable scenario: a massive natural disaster, ransomware attack or data corruption problem affects our premises and we lose dbadb entirely, together with all the schemas that it hosted, including the bacula schema, that is, the catalog that holds metadata about backups of all MySQL replication chains.

There would be no automated way to restore dbadb from one of its backups taken by Bacula. The reason for this is that to restore any MySQL replication chain from an existing backup (including dbadb itself) automatically, a working Bacula Director is needed to create and schedule the underlying restore job. And, in turn, a working Bacula Director requires access to a valid catalog (dbadb.bacula schema) to retrieve critical information about backup jobs and restore volumes. Namely, each database record contains the date of this backup as well as the storage server(s) where the backup is stored.

Figure 4: Circular dependency between Bacula Director and dbadb

Of course this observation is not new: operating with circular dependencies in a large-scale system is a strategic trade-off rooted in architectural complexity and pragmatism: over the years our systems have grown larger and larger and teams do not always have the time to retrospect, redesign and modernize their systems. We do so when we see meaningful gains during major initiatives, like that one we are discussing in this article: moving services to the cloud. Adding to this, relying on third parties does not (always) guarantee resilience as we saw rather recently. Still, the above pose serious questions related to our automation around Disaster Recovery (DR) mechanism and give us strong motivation to make our systems more resilient.

NOTE: it is technically possible to restore dbadb manually from a backup (assuming we find where the backup is stored), set up a Bacula Director that talks to it and start restoring the rest of MySQL replication chains using older backups. However, we treat this as our last resort since it comes with great risk, complexity and, of course, extended downtime.

Choosing the right candidate for our cloud migration

We examined many different MySQL replication chains for the aforementioned cloud migration, including complex ones that our core DBA control plane relies on. Since this was the (very) first time that we migrated a database from on-prem to the cloud and we did not want to risk our business, we came up with the following criteria that a good candidate should meet:

High Impact/ROI: considerably improve the resiliency of our systems once the database is migrated to the cloud.
Simple Patterns: single-schema, single-tenant, relatively low connection rate, predictable traffic, no need for (very) low latency responses.
Tolerance: few hours of downtime are acceptable without negatively impacting business.

Yes, you guessed correctly — we chose to extract and migrate Bacula’s catalog from dbadb to Amazon RDS since it ticked all boxes:

Migrating the bacula schema outside our premises would eliminate a long-standing circular dependency in our disaster recovery mechanism. By leveraging Amazon’s fully managed services, backups of the bacula schema will now be independently managed by RDS and stored securely in multiple AWS locations (S3), utilizing both automated backups and manual snapshots.
With the entire process confined to a single, relatively simple schema (dbadb.bacula) that is consumed exclusively by the Bacula Director (single instance running on-prem), the migration seemed manageable and conceptually clear. Bacula itself has chosen not to separate between RO and RW database handles so all it needs is a single server sending all queries. Even though Bacula’s database powers the daily backups for all MySQL replication chains that we manage, it operates with a moderate and predictable average number of User Connections (~200), Connected Threads (~24) and Queries (~92) per second.
Our compliance standards incorporate an error budget for database backups, giving us the flexibility to sustain backup system downtime in the order of hours. This means that during or after the migration we could do extra testing, maintenance, resizing of instances or fine-tuning before resuming normal operations.

Migration plan and challenges

Once we chose bacula as the schema to be migrated from our premises to the cloud, we started working on a migration plan that met business requirements. Even though the migration plan was ready in Q2 2024, it was marked as ready-for-execution in late Q3 2024. And by this time new business requirements came up which changed the original timeline and considerably expanded the scope of our work.

First, a high company priority came in related to operating systems. Since CentOS 8 (COS8) had already reached EOL we had to migrate our fleet to Oracle Linux 9 (OL9) which was chosen as its successor. Of course all our Bacula on-prem hosts had to be upgraded as well: 2 Bacula Directors (one active, one passive) and 38 Storage Daemons distributed across 2 regions. Second, a newer version of Bacula Enterprise was released (v16) — the version of Bacula Enterprise that we were using (v12) was already old and had to be upgraded as well, since it did not support the new operating system we were transitioning to (v12 RPMs are not available for OL9). In short, we had to adapt our migration plan and prioritize these items for compliance, security and compatibility reasons.

Generally speaking, and based on standard engineering wisdom, one should not combine migrations. Usually, this strategy adds time overhead, increases the blast radius in case of failure and makes debugging harder, due to inter-dependencies and multiple variables changing simultaneously. However, in this specific case the “3-in-1” migration made sense: if we chose a “side-by-side” migration strategy instead of an “in-place” one we could deal with the additional requirements quite effectively. That is, we could build a new Bacula v16 cluster on OL9 from scratch (Greenfield advantage) and then simply follow the steps from the (original) cloud migration plan:

OS Upgrade: Provision new machines running OL9 for the new Bacula installation.
Bacula Upgrade: Run Bacula Enterprise v16 in the newly provisioned OL9 machines.
Database Migration: Migrate Bacula’s catalog from on-prem dbadb to Amazon RDS.

The upgrade(s) to OL9 and Bacula v16 could be tested independently without affecting the existing backup system while we could treat these steps as “preparatory” ones for the migration of the bacula schema to Amazon RDS. Eventually, we decided to treat the new stack (OL9 + Bacula v16 + RDS MySQL) as a single, cohesive unit to which we could incrementally move backups to while keeping the old stack (COS8 + Bacula v12 + on-prem MySQL) around for continuity and easier rollback. This phased approach would enable us to renovate our backup infrastructure at 3 different levels and give us enough breathing room for the years to come — at the cost of the whole migration taking longer to complete.

At this point we should note that we deliberately rejected a replication-based migration to avoid potential schema conflicts between the old and new Bacula versions (v12, v16) together with the technical friction of configuring an RDS replica from an on-premise source. After all, since we would be running the old and new Bacula clusters in parallel for longer than our backup retention window, we could achieve a clean ‘fresh start’ where the new system naturally built its own complete history for compliance without us having to copy/import historical data.

With this plan in mind, we proceeded and discussed more topics around the upcoming migration:

Business Continuity: daily backups of all MySQL replication chains must continue to operate smoothly throughout the migration. For this, we would have to add missing logic and features to our DBA control plane so that it can communicate with multiple Bacula clusters instead of one, given that the old and new backup system would run in parallel during the testing/validation phase.

IaC Dogfooding: we wanted to have a customer-like experience during the migration when it came to the provisioning of a new MySQL database on Amazon RDS. To this end, we decided to use the same Terraform modules that are available to customer teams in order to create and manage needed cloud resources, e.g., DB Instance, IAM Roles, IAM Policies, Keys, Secrets, etc. This would surely help us feel customer pains and likely lead to our Amazon RDS Terraform module becoming more robust and mature.

Multi-region Support: when we chose Amazon RDS as the target system for our cloud migration we were well-aware that it does not support multi-region, only multi-AZ (also see RDS SLAs). The source system for our migration (on-prem MySQL) was already running on two different regions so we knew we had to invest time into extending the Terraform module for Amazon RDS and provision a second DB Instance in a different AWS region that would replicate from the primary one. We saw this both as a challenge and a good learning exercise. After all, and as mentioned earlier, Bacula Director (the consumer of the bacula schema) runs as a single instance in a single region in our premises.

NOTE: Yes, Amazon Aurora offers multi-region redundancy out of the box which would mirror our on-prem setup. However, we deliberately chose to use Amazon RDS as our target system to align with our strategic priority of ‘dogfooding’ our Amazon RDS Terraform module, gain operational experience with Amazon RDS and its ecosystem, while also keeping cloud costs lower.

Performance vs Costs: right-sizing an RDS instance based on the on-prem metrics that we were collecting would be difficult. Remember, the bacula schema was co-hosted with other schemas on an on-prem MySQL replication chain. At minimum, we wanted the newly created Bacula Amazon RDS instance to perform equally well with our old setup, while keeping cloud costs under control.

Separation of Concerns: moving the bacula schema out of the dbadb on-prem database should remain our top priority and driving force. Chain splits are a standard technique that we apply to “shared” MySQL replication chains in our continuous effort to reduce multi-tenancy and separate concerns.

Figure 5: New setup with Bacula catalog on Amazon RDS

The Migration

We kicked off the migration in Q3 2024.

Step 1: Provision new Bacula cluster

The first step was to build the nodes for the new Bacula cluster from scratch using Oracle Linux 9 and Bacula v16. Note that the new Bacula Director and Storage Daemons would continue to run on-premises while Bacula Director’s catalog would be running on AWS. This step required ingesting and mirroring the latest Bacula Enterprise RPMs, while it also brought important changes to our Puppetry. Overall, the node provisioning phase was smooth and we had new ground to work on.

Step 2: Provision new Amazon RDS DB instance for Bacula

Next, we had to build a new Amazon RDS DB instance to host the Bacula catalog. For this, we started working with the shared Amazon RDS Terraform module within the company, which was owned by a different team back then. Since we wanted a flexible, yet opinionated, way to provision DB instances on Amazon RDS we introduced a thin, internal Terraform module that inherited from the parent one and was entirely under our control. Eventually, after tweaking some input parameters, familiarizing ourselves with the cloud resources that would be created under the hood we had a multi-AZ Amazon RDS DB Instance up-and-running ready to host Bacula Director’s catalog:

Figure 6: Internal Terraform Amazon RDS module

One of the first questions when migrating a database from on-prem to AWS is how to size the new instances correctly. For MySQL replication chains that host a single schema this is usually straight forward. However, the old Bacula schema was shared on an on-prem database server (dbadb) together with other schemas, making it nearly impossible to isolate its specific CPU and IOPS usage.

Usually, instance sizing in RDS is done based on the number of CPUs, memory size, disk size, IOPS and IO throughput. For databases running on-prem we do collect disk space metrics per schema. However this is not the case for the other metrics which are collected per host/machine. Even metrics like Queries Per Second (QPS) per schema don’t always translate linearly to cloud resources. A schema doing 10K lightweight QPS might need less CPU than one doing 1K heavy QPS. InnoDB buffer pool size usually gives a good indication of required amount of memory for high performance — however this metric is also collected per host and not per schema. In short, lacking precise per-schema metrics while taking a schema out of an on-prem database and moving it to a cloud one, we had to interpolate. Due to the criticality of the backup system operating smoothly, we decided to begin with instances that are equal in size to the servers used for the dbadb MySQL replication chain and with suitable disk size for the Bacula schema.

At this point we faced a choice: attempt to “right-size” aggressively and cut down on budget early, or “over-provision” to ensure reliability and optimize later. Eventually, and since our budget allowed this, we chose the latter and, after reviewing the available instance types for Amazon RDS, we decided to go with db.r7g.2xlarge and conditionally downsize in the future. That is, we provisioned an RDS instance that matched our on-prem capacity, accepting higher initial costs to eliminate the risk of performance bottlenecks during the migration. Over-provisioning would help us determine the right size in the future, based on the current usage after the system runs at full capacity for enough time. We viewed this as a “Day 2” optimization — reliability must always come first.

Step 3: Extend DBA control plane to talk to multiple Bacula Directors

Applying a “side-by-side” migration strategy with parallel environments (old and new) meant that we had to extend our DBA control plane and tooling to be aware of multiple backends for backups instead of one. Now, we had two active Bacula Directors running: one scheduling backups using the old Bacula catalog hosted on-prem (dbadb) and one scheduling backups using the new Bacula catalog hosted on Amazon RDS. Then, we needed a way to verify the stability of the new stack with real production data while the old stack continued to handle the majority of the load.

In this direction, we made our backup configuration mechanism more flexible by introducing a new runtime option to route specific subsets of our backups (e.g., regional, regular, immutable, chain-specific) to individual Bacula clusters:

{
"bacula-blue": {
"region1": {
"take_regular_backups": false,
"take_immutable_backups": true,
"chains": ["chain1", "chain2", "chain3"]
}
},
"bacula-green": {
"region2": {
"take_regular_backups": true,
"take_immutable_backups": true,
}
}
}

This JSON object is parsed and interpreted by our in-house DBA control plane which generates and pushes the desired Bacula Director configuration. For a given period of time (2–3 weeks) we took backups on both the old and the new system and gradually moved all backup jobs to the new system as we gained more confidence. During this time we were monitoring both the health of the newly created Bacula Amazon RDS instance as well as the success rate of daily backup and restore jobs:

Figure 7: Grafana dashboard to monitor Bacula jobs

Note that to monitor multiple Bacula clusters we had to make adjustments both to our backup/restore-related metrics (e.g., add label for Bacula cluster) as well as our Grafana dashboards (e.g., add variable to select Bacula cluster). Eventually, this architecture allowed us to bring the new cloud-based system online without taking the old on-prem system offline and, thus, avoid risking our compliance posture during the transition.

Step 4: Set up regional disaster recovery

While we were gradually moving backups to the new Bacula cluster powered by Amazon RDS, we knew we had to deal with its Disaster Recovery capabilities — since Bacula is our main disaster recovery mechanism. By default, and as mentioned earlier, a single Amazon RDS DB Instance cannot survive a total regional failure. In our on-prem setup, dbadb was running in 3 regions with automatic primary failover. This means that accessing the Bacula catalog would still be possible in case one region went down. Of course we did not want to compromise this resiliency by moving to Amazon RDS, so we wanted to ensure data availability in at least two AWS regions, that is, move from a zonal to a regional failure domain.

First, since our backup system can tolerate some downtime, we had the option to rely on Amazon RDS’s automated backups and also enable automated cross-region replication. In case of a regional failure we would have at least one (older) snapshot of the Bacula catalog in order to restore it. Second, we opted for provisioning a second Amazon RDS DB instance in a different AWS region which would replicate from the primary one. Note that this custom setup does not guarantee an automatic failover, but it makes a manual failover much easier. More specifically, we would have a “stand-by” primary ready to take over using fresh data instead of restoring from an older snapshot. Again, this is possible since our Bacula-based backup system and compliance requirements allow for a few hours of downtime. If you manage more sensitive systems that cannot tolerate such a long downtime this approach is not recommended — instead, you should invest in a solution that natively supports multi-region deployments.

Looking back at step 2, we initially provisioned a single-region multi-AZ RDS DB Instance, planning to add a cross-region replica later. Well, this backfired at us as we soon realized that AWS KMS encryption keys (which are automatically integrated into Amazon RDS and used to protect data stored in underlying disks) must be configured as “multi-region” right from creation — this cannot be changed later. Because we started with a single-region KMS encryption key, we couldn’t simply add a replica in another region later. To resolve this, we went through the following manual steps:

Create a new multi-region KMS key
Snapshot the existing database
Build a new cluster from the snapshot. This required decryption of the snapshot using the old key and re-encryption of the DB instance’s disk with the new key.

This was a quite painful lesson that taught us to engineer for failure and prioritize multi-region requirements earlier than later.

Step 5: Chain Split

The most critical outcome of this migration was to break the circular dependency and move the bacula schema outside the dbadb MySQL replication chain. With the Bacula catalog now living in Amazon RDS its failure domain lives outside the systems it protects. If our on-prem environment suffers a catastrophic failure, the backup catalog remains available in the cloud, ready to orchestrate the recovery.

As mentioned earlier, we chose not to copy or import any data from the old on-prem catalog to the new one stored in Amazon RDS; instead, we configured our DBA control plane to schedule backups on both Bacula clusters for a given period of time. Once we were confident enough and fully transitioned to the new system, we took a snapshot of the on-prem bacula schema, archived it and finally dropped it from dbadb, after ensuring that it had zero consumers.

NOTE: Taking backups in two Bacula clusters instead of one (even for a fixed period of time) comes with additional compute and storage requirements. We closely co-ordinated with the Infrastructure and Storage teams on this to ensure that needed resources were available for us during the migration.

Step 6: Right Sizing on Day 2

By this stage, we had a fully functional Bacula cluster with its catalog stored in Amazon RDS and replicated across two AWS regions. As noted earlier, we initially launched with over-provisioned instances to mitigate migration risks. After two months of gathering production metrics, we analyzed CPU, memory, and IO usage to begin right-sizing. We optimized the instances to match our actual baseline while ensuring sufficient headroom for spikes — such as ‘thundering herds’ of jobs following downtime or heavy compliance reporting queries. Ultimately, right-sizing is not a one-off task; it is a periodic operational exercise essential for ensuring your capacity evolves alongside your system.

Figure 8: Amazon RDS Bacula DB instance metrics

Currently, our Bacula cluster is running with db.r7g.xlarge instances on Amazon RDS and we are quite happy with it.

Key Lessons Learned

Through this process, we crystallized several lessons that will guide our future database cloud migrations:

Start Simple

Always choose a fairly simple service for your first cloud migration over a more complex one; this will give you some peace of mind to experiment with the new target system without risking too much. For us, Bacula was a great candidate schema with simple structure, consumed by a single tenant and considerable ROI once migrated to the cloud.

Ensure Business Continuity

In complex scenarios you should carefully choose your migration strategy. By building a parallel stack or enabling replication between on-prem and cloud databases you can dramatically reduce the risk of breaking your system. In our case, performing a 3-in-1 migration was actually safer than a sequential migration because it avoided the “intermediate hell”. If the new stack failed or misbehaved during testing, the old production system would remain untouched.

Prioritize Reliability

Right-sizing resources while also managing cloud costs is important. Depending on the business case and available budget one might choose to prioritize system reliability and resilience. When migrating critical systems, don’t try to over-optimize costs on Day 1. When in doubt, and assuming that your budget allows it, match your on-prem capacity even if this means over-provisioning for a given period of time. It is safer to request extra capacity to avoid disruptions and optimize later, once you have real baseline data**.**

Design for Failure Early

If your system has disaster recovery needs, put this high in your priority list. That is, design and build a multi-region architecture right from the start instead of “adding it later”. Due to unforeseen issues or major blockers (like the KMS key issue we faced) you might need to rebuild your target system from scratch — or even worse — abandon your migration altogether.

Conclusion

Overall, the “3-in-1” migration of the Bacula catalog was a success and contributed to the company’s cloud transformation effort. Not only did we move our first database on Amazon RDS, but we also significantly renovated our backup system (OS and software). Most importantly though, we broke a long-standing circular dependency in our disaster recovery mechanism and secured our ability to recover from a total on-prem failure.

This journey proved that the best way to learn cloud operations is to do it with real production workload. Along the way we extended our DBA control plane and tooling, submitted numerous improvements to the company’s IaC modules and built a blueprint for teams at Booking.com to follow to make their future migration easier.

Breaking the Loop: How we migrated our backup catalog for 250+ MySQL clusters to AWS