Cart migration from Cassandra to Mongo for an enhanced shopping experience

Parul Agnihotri
Myntra Engineering
Published in
11 min readSep 14, 2023

--

Co-Contributors: Shubham Shailesh Doijad, Pramod MG,Ajit Kumar Avuthu, Vindhya Shanmugam, Subhash Kumar Mamillapalli

Introduction

Cart is a critical component in an e-commerce business. A smooth checkout process increases User’s confidence in the platform. Given the importance of a Cart in preserving the user intent to purchase we need to make sure the Cart systems are highly stable, performant, fault-tolerant & resilient to high traffic.

Myntra continues to grow at a rapid pace when it comes to its user base. The more users onboard onto the platform, the more shopping carts are created and we have more operations per Cart. Cart data is persisted in the Cassandra datastore. Cassandra was a compelling choice for Cart persistence due to its massive scalability with respect to write operations and fault tolerance capabilities. However, why this failed to be our perfect choice & our migration journey towards MongoDB is what we will be discussing in this blog post.

The decision to migrate from Cassandra to MongoDB

During high volume of traffic Cart services started to observe ~0.05% error rates due to timeouts & increased latencies of ~1 sec. The latencies from Cassandra DB were shooting >250 ms. This impacted a high number of users unable to ‘Add to Cart’ and caused dips in our revenue. Some of the major observations with Cassandra DB are mentioned below.

Problems observed with Cassandra

We are using Cassandra as our primary database to persist user carts. Though initially it would have made sense given the high availability and scalability, of late we started noticing quite a few performance issues affecting the overall stability of the cart application.

  • Tombstones causing high read latencies: In the cart, we have almost 35 percent operation as an update(overwrite), delete, and TTL expiry which resulted in a very high number of tombstones in Cassandra. Each deletes/Update operation on the cart resulted in 16 tombstones. When tombstone size increases too much it results in slower read query due to db needing to scan a large number of tombstones in order to reach live data. This in our case was causing latency spikes of more than 250 ms increasing our SLA and service performance. In order to fix this issue we needed regular repair and compaction of our db, especially before HRD events.
  • Using batch statements: Multiple lookup tables created in Cassandra resulted in at least 4–5 queries for reading and updating the tables whenever the cart gets created, merged, or deleted. these batch statements were used for consistency. This increased in higher network traffic to DB, overall higher network bandwidth utilization, and increased latencies.
  • Support for non-columns: Lack of support for counter columns in a table along with non-counter columns. We need to have a separate table just to track the count of products resulting in increased complexity of managing multiple updates.

Cassandra Improvements

Few improvements were made to rectify the issues.

  • GC Grace Period Reduction: To reduce tombstone accumulation, the GC Grace Period was decreased from the default 10 days to just 6 hours, resulting in faster tombstone removal and improved performance. We were using quorum as consistency which made sure we didn’t have any issues of deleted data reappearance due to tombstone not being propagated to replicas before compaction.
  • GC Long Pauses: Frequent system stalls and read latency spikes were observed, coinciding with GC long pauses lasting 2 to 3 seconds. Enabling GC logging and analyzing GC performance helped identify this issue. After identifying and addressing long GC pauses enhanced system stability and reduced read latency spikes.
  • Off-Heap Caching: To address performance concerns, cache settings in Cassandra were reconfigured. The key_cache_size_in_mb and row_cache_size_in_mb were reduced from 2GB to 512MB and 4GB to 512MB, respectively, while file_cache_size_in_mb was increased from 8GB to 18GB. Off-heap caching options like chunk cache were utilized to parallel cache parts of SS table reads from disk. Reconfiguring cache settings and utilizing off-heap caching improved cache usage efficiency and speed up read operations.
  • Tuning Bloom Filter Size: Bloom filters, which help speed up read operations by avoiding unnecessary checks in SS tables, were tuned to optimize their size in memory. Optimising Bloom filter size in memory accelerated read operations by avoiding unnecessary checks in SS tables.
  • Compaction Strategy: Choosing an appropriate compaction strategy based on the workload is crucial. The default Size Tiered Compaction strategy is suitable for write-heavy workloads, while Leveled Compaction may be preferable for read-heavy workloads. Selecting appropriate compaction strategies based on workload needs optimized data storage and query response times.

Overall, these measures led to a more efficient and stable Cassandra system with reduced stalls, better cache usage, and optimized compaction strategies for specific workloads. Post these changes we were able to support 1.5 times the load that we were taking previously. However, these were not sufficient for solving our issues because our DB is inherently transient which needed a solution that is highly optimized for reads & deletes.

Choosing Mongo

There were some major considerations in choosing a data store for Cart. Strong consistency, support for secondary indexes, etc. helped us to evaluate if Mongo was the right choice.

  • Cart is a read-heavy system, the ratio of read: write is 3:1. Heavy read loads can be handled by adding more slave nodes to the MongoDB cluster.
  • MongoDB supports secondary indexes on any field in a document making this more suitable for some of the Cart use cases.
  • Cassandra nodes required repair & compact operations periodically that could be avoided with the MongoDB cluster.
  • Mongo came across as a low-maintenance server with a lower number of nodes compared to the Cassandra nodes required to support the same traffic. We reduced the footprint by 40%.
  • Mongo provides ease of mapping the business data to the schema with optional attributes compared to Cassandra.

Few other considerations we took into account to move forward with Mongo,

  • The entire cart data is stored as one single document. While this optimizes the reads where the entire cart is loaded in one go, it can be taxing in terms of data throughput on the disk because every read or write has to incur a very high amount of IO. However, for workloads where the entire cart is loaded, this model can offer better performance through internal DB caching and OS page cache.
  • The entire cartload for checkout operations needs to be supported by the Master / Primary node as checkout needs highly consistent data

Data Modelling & Schema Design

Old Cassandra Schema

One major issue with the Old Cassandra schema was the use of multiple lookup tables for fetching data. To access a cart, the system has to perform two read calls: one to the lookup table for the Cart ID and another to the cart table for the associated data. This led to a minimum of two read calls for every Cart read operation. When any insert or delete operations are performed we have to do at least 2 write calls.

Mongo Schema

A few changes were made to correct the above issue in the new Mongo schema. Instead of using a lookup table to fetch a cart-id, we decided to use an explicit user identifier part of the car-id to remove our need for multiple lookup tables.

  • Logged-In-Cart: This document stores the cart ID and session ID along with the cart-related data. The decision to store the session ID explicitly in the logged-in cart document was to reduce the number of merge operations.
  • Non-Logged-In-Cart: This document contains a non-logged-in user cart. This has a different expiry than a logged-in cart. And we don’t store user-related data in this cart. This cart gets deleted once the cart merge process is completed.

Cart Merge Process

When a non-logged-in user creates a cart. andThen the customer logs into his/her Myntra account then the customer expects items from the customer’s non-logged-in cart and logged-in cart to show after logging in. This process of merging both logged-in carts and non-logged-in carts is called carts merge.

Cassandra vs. Mongo DB Processing

Cassandra Flow

Mongo Flow

TTL Solutioning for expired Carts in Mongo

The system maintains a single collection for both logged-in and non-logged-in user carts, each with different TTLs (Time-to-Live). The expiry date is set as 0, and the actual expiry time is managed via the application using a date-time format for TTL. A TTL Monitor periodically scans the Mongo index created over TTL and identifies documents that need to be deleted for expiry. Checkpointing occurs every 50 seconds, and during the next 10 seconds, the monitor asks the MongoDB thread to expire as many documents as possible.

Unexpired documents accumulate in a stack if not processed within 10 seconds, and the MongoDB thread is then asked to expire them in bulk. After expiration, a reindexing process is triggered. To minimize downtime during high indexing, the expiry of documents is scheduled for midnight when overall traffic is lower. This approach reduces the number of indexes created and ensures that all expirations occur during off-peak traffic hours.

MongoDB Benchmarking for Scale

The goal of benchmarking was to identify the right hardware and software configuration to handle the cart traffic keeping in mind the consistency requirements for the Cart application compared to Cassandra.

  • The first step was to determine the NFRs for benchmarking
  • Number of concurrent requests keeping future needs
  • Expected throughput for each of the APIs
  • Consider all the ongoing features for scale
  • Expected latency levels
  • Finalize the data model to be used with the Mongo for benchmarking exercise
  • Determine the hardware configuration -
  • Number of cores
  • RAM considering the hot data ( say 25–30% ) of the workload
  • Estimate storage capacity
  • Hard disk required with caching enabled
  • Determine the MongoDB configurations.
  • Mongo versions
  • Master, Slave setup ( 1 Master, 2 Slaves to start with )
  • Checkpoint — Every 1 minute
  • Journal Sync — Every 100 ms
  • Cache configs
  • Consistency level
  • Determine the workloads based on the current traffic patterns.
  • Read heavy with read: write ratios 3:1
  • Prepare sample data set
  • Determine the Client configurations
  • Connection pool size
  • Execute different workload runs
  • Consider All reads
  • Consider Mixed load
  • Consider different consistency levels
  • Capture both the Application & DB metrics
  • Latency, Throughput
  • CPU utilization, RAM usage
  • Number of IOPS per sec on disk for reads & writes
  • Disk IO utilization
  • Disk IO throughput
  • P99 & p99.9 latencies

We were able to benchmark ~1.7 Million RPM & latency ~50ms(p99.9)

Migration Strategy

Below are the goals set for the migration strategy,

  • Zero downtime
  • Zero impact on the traffic
  • Enable AB for routing traffic

Enable bi-directional dual sync between Cassandra & Mongo

Before we performed a one-time migration we established a data synchronization pipeline between Cassandra and MongoDB. Any writes to Cassandra will be synced to MongoDB and vice versa. This needs to be handled asynchronously so there is no impact on the write latency of the user requests that are flowing in. Handle all the race conditions, failures & retries so the data is always in sync between the 2 datastores.

As part of the dual sync process, all operations performed in one DB are synced to the other DB. A Kafka-based pipeline from primary DB to secondary DB is established. A journal entry is created when any operation is performed by the user capturing the action he/she has performed. An event is created and published in the Kafka pipeline. This journal entry in the DB makes sure that we don’t switch the user’s allocated DB without consuming the event in the secondary DB. Once a consumer consumes the event, the DB entry is deleted. And if we don’t find any entry associated with the user, then the user becomes eligible for the migration.

One-time data migration

The plan involves a one-time data migration of the complete cart dataset from Cassandra to MongoDB. This will be accomplished using a batch application that reads the lookup tables and moves carts associated with each entry by querying the Cassandra Cart table to get the latest data. To prevent redundant migration, records already migrated through dual sync will not undergo bulk migration.

Incremental traffic routing using A/B

Post the one-time migration configuration for both feature gate and user-driven A/B. Enable A/B for traffic to flow into the MongoDB cluster for say 1% of users. Rollout to 100% users once there are no issues reported.

Ensuring Fault Tolerance

Dual Sync Journaling

  • In order to ensure that there is no data loss during dual sync or one-time migration we decided to maintain a journal. The idea of this journal was that whenever a write operation is performed on the user’s primary DB (it can be Cassandra or Mongodb ) we maintain that event in that journal until that data is synced to the other DB.
  • Let’s consider the case, if a user is making some changes to their cart and during that time if we enable AB for that user then there can be some data loss as dual sync is an asynchronous process so will take some time to replicate the whole data.
  • So whenever a user tries to access their cart we first call the journal and check if there is an already existing event for that user, We get the primary DB (meaning where the previous write happened) which is also maintained in that schema.
  • Once the dual sync job is completed that event is deleted from the journal.
  • Additionally, we maintain only one single event of any user in the journal, meaning if the user performs 2 writes back to back and even if 1st write was not replicated to the secondary db we will keep the event of the latest write only, this ensures always the latest data is synced to other db.
  • For all the records that fail to be published in Kafka (due to any reason whatsoever), the sync job will pick up the DB entries related to the change and then update the second DB based on those entries.
  • During One-time migration, we had one job that used to continuously sample both the DB data for random users for any discrepancy. This was very important as we had major changes in our schema design in both the data stores.

Metrics, Alerting & Monitoring

Below are some of the metrics captured that helped with resolving issues faster.

One-time data migration metrics

  • One-time migration (rate, overall)
  • Errors in One time migration (rate, overall)
  • Data validation errors(rate, overall)

Dual sync metrics

  • Update Events published (rate, overall)
  • Update Events consumed (rate, overall)
  • Errors in publishing, consumption (rate, overall)

Deployment Strategy

MongoDB was ‘Dark Launched’ to avoid any risk to the incoming traffic. Sequencing the rollout played a great part in the seamless migration to Mongo.

  • First, we enabled dual sync
  • Then we enabled sync journaling (DB decision table which will make sure your DB doesn’t change before your Data is fully migrated)
  • Perform one-time migration of the existing carts (All those carts that were not moved by dual sync)
  • Reconcile all the errors from the one-time migration
  • Turn the feature gate ON
  • Turn the A/B ON for a small percentage of users to route the traffic to Mongo
  • Monitor the dashboards for any anomalies
  • Fix any sync-related issues between the 2 data stores
  • Increase the A/B users incrementally
  • Reach 100% of users traffic flowing to Mongo
  • Clean up any existing data in Cassandra
  • Plan for decommissioning of the Cassandra nodes once stable

Conclusion

It’s been a couple of months now, we have not faced any issues since the migration and we have successfully resolved all the performance bottlenecks we had observed with Cassandra. Tens of millions of carts have been migrated, improved latency from 250ms ( during spikes) to < 25 ms. The performance with respect to reads has drastically improved with better schema design and Mongo DB’s read performance. This has also reduced our hardware footprint by 40% and heavy maintenance & operation cost that was required for Cassandra have been removed. while we never want to rule out the fact that Cassandra with a better schema right from the beginning of Cart evolution could have achieved a better performance than what we got with it.

Overall, we are now equipped to handle up to 3X our current traffic volume by scaling Mongo nodes before we need to reconsider our future next best choice.

--

--