The Staging Dichotomy: Part One

Editor’s Note: A two-part series on how eBay turned around an impeding staging environment into its biggest asset for developer productivity.

The act of staging either in plays, real estate or software development has been an industry practice for many decades. The staging methods have evolved over the years and continue to play a critical role in its associated fields. When it comes to software development, staging is the environment where software is tested end-to-end before it reaches customers. It is a near-exact replica of a production environment, requiring a copy of the same configurations of hardware, servers and databases — basically the whole nine yards — but on a smaller scale. However, in recent years, eBay engineers continued to field the question, “Do we really need staging?”

Do We Really Need Staging?

The benefits of having a staging environment versus the cost of maintaining one may appear lopsided. And there is some veracity to it. Maintaining a fully functional staging environment is hard, really hard. Even if you make it ideal, without a proper system in place, it quickly starts regressing. Moreover, with software releases transitioning from a waterfall to continuous delivery, the changes become small and incremental, opening the possibility of direct canary testing in production. There is even an excellent InfoQ talk titled “Production — Designing for Testability” on this subject.

At eBay, we were facing the same question. eBay has had a staging environment since its dawn. It is embedded as a part of the development workflow and has been crucial in making eBay the global marketplace it is now — but in recent years, the environment has been deteriorating and became an impediment to developer productivity. In early 2020, we did a companywide developer survey, and a constant theme emerged: a broken staging experience. It was one of the topmost concerns.

We had two obvious choices — either fix staging or abandon it. We had rigorous debates on this topic at various levels in the organization and across all stakeholders. The issue with abandoning staging is that you end up moving your integration and end-to-end testing to production. There are few ways we can approach this.

Options

One option would be to create a separate zone in production that is not exposed to the public and is open only to internal eBay traffic. Developers can deploy the outgoing software in this zone and run their entire suite of integration test cases before deploying to production. We indeed have a zone in eBay like this, and it is called pre-production. The issue here, though, is that the data source behind pre-production is the same as production. This means all your test data creation and mutation happen alongside production data. When we tried this in the past, it ended up being an analytics nightmare, where the continuous runs skewed production metrics. Creating a “test” versus “customer” metrics dimension helped a little. However, the data corruption ran deep into production databases and became a real issue. Even with data teardown being part of the test suites, the massive scale of integration tests run continuously across the entire marketplace can flip the production data store into an egregious state.

Another option would be to create a separate instance of data sources (i.e., databases) in the same pre-production zone, just like the machines, which are only accessible internally. You can already sense where we are heading. We are slowly creating a whole new environment and calling it a different name instead of staging. It will soon face the same problems, and we are back to square one. Or the third option, we build sufficient confidence in our unit and functional testing and directly test in production with actual users.

Context Matters

All of the above approaches have a fundamental limitation, and this is where context matters. eBay is an ecommerce platform. Transactions are essential to whatever we do. And when there are transactions, there are payments involved. We are talking about actual items, transacting between genuine sellers and buyers with real money. The margin of error has to be minuscule. It is just not possible to execute all your test cases in production. Even if we start with a tiny amount of traffic, we need to ensure that all the dependent services work harmoniously to keep the transactions accurate. These services are also rapidly changing, and the assumption that they will just work when put together in production is not worth the risk. Especially when payment is involved, even in the smallest quantities.

For automated and continuous software deployments to work, the software has to go through a pipeline.

A typical pipeline will be like below.

Software Release Pipeline

Software Release Pipeline

The goal of each stage in the pipeline is to boost confidence that when the software eventually reaches production, it is thoroughly tested and ready to serve customers. A vital component of that is integration testing. We need to ensure that the system works end-to-end in all possible scenarios with zero compromises. Developers should not have any impediments in writing and running these integration test suites. We need an environment that is isolated from production and yet has all production qualities, where we can boldly execute these test cases. That environment can very well be called staging. The staging step can be the final confidence booster before software enters the prod zone.

Again, these conditions apply to the majority of eBay use cases. Others may not see this as a limitation. They can skip staging, directly deploy to production as a canary and ramp up traffic. The whole flow can work seamlessly. Even within eBay, few domains follow this model of bypassing staging and direct canary testing in production. However, they are restricted to read-only use cases. The rest still need staging to build that confidence.

Now that we reached consonance that a staging environment is indeed needed, the next step was to fix it. We formed a companywide staging initiative and a dedicated team to work on it. However, we had a problem, and that was with the problem statement itself.

The Problem with the Problem Statement

Developers were all entrenched with the notion of a broken staging environment, but to say “staging is broken” is unactionable and lacked specificity. Yes, we all knew that staging is broken, but what does that really mean and how can we work toward a solution?

Generic statements are easy to make but difficult to act.

As appealing it may seem to solve, the vagueness can make you spin in circles without the desired outcome. So as a first step, the core staging team set out to break the generic statement, “staging is broken,” into specific problems that would enable us to design holistic solutions. Specificity was the key here.

Actual (or Actionable) Problems

We embedded ourselves into the software development workflow of a few critical applications to understand the actual bottlenecks of staging. After a thorough firsthand experience, we derived the following conclusions.

Staging data was unusable. It lacked quality and, as a result, the associated quantity. Prolonged misuse resulted in the data being futile and prosaic. Yes, there were terabytes of data, but they were not relevant for most test cases — an example below from screenshots taken in early 2020.

Screenshots of search and item pages with low-quality staging data for the query boys shoes

Screenshots of search and item pages with low-quality staging data for the query “boys shoes.”

Staging infra was unstable. The hardware, monitoring, remediation and support system were inconsistent and not on par with production. The biggest annoyance here was predictability. The staging machine would be up and then suddenly go down. The same project would take a few minutes to build in one instance, but then more than an hour in another instance. All these were symptoms of infrastructure instability.
A vicious cycle — the chicken and egg problem. The application teams faced challenges in keeping their functionality up and running in staging, citing the lack of data and infrastructure. And since the applications were stale and not deployed regularly, the infra teams were not incentivized to monitor and scale up the systems. We were trapped in a vicious cycle.

Being specific seems to pay off. We now have a set of actionable problem statements.

Data

A common and well-established idea proposed to address data issues is to create quality data in large quantities before executing the test cases and tear them down once done. Most organizations have well-defined APIs to create data; why not leverage them? In reality, though, this is easier said than done.

Again, context matters here. It is nearly impossible at eBay to create the millions of permutations and combinations of listings required to execute thousands of test cases across the marketplace. You can create monotonous data in large quantities. However, creating a listing with multiple SKUs — with each SKU having a valid image, reserve price, 30-day return policy, and an international shipping offer with an immediate payment option — can quickly get out of hand. There is no straightforward API to create listings like this, and we need them to automate many of our Priority one (P1) use cases.

We have tried this many times in the past, and it did not work.There has to be a better way. We had to look at it from a different perspective. An idea emerged, which now, in hindsight, seems quite obvious.

Take a subset of production data and move it to staging in a privacy-preserving way.

eBay has 1.5 billion listings in production. Just a tiny (0.1%) subset of the listings, along with its dependency graph, should be sufficient to execute all the test cases confidently. We have to make sure that the subset is well-distributed to cover the breadth of eBay inventory. The production criteria naturally yield themselves to high-quality data. But the most important thing to us was privacy.

At eBay, we take privacy very seriously. It has been our core pillar since the very beginning. Fortunately for us, a listing and most of its associated attributes are public information. The seller and buyer’s Personal Identifiable Information (PII), along with a few item aspects like reserved price, max bid price, etc., have to be anonymized and privacy preserved. To build this pipeline, we partnered with a privacy company Tonic.ai who exactly does the same.

At a high level, our pipeline looks like this.

Production to staging data pipeline

Production to staging data pipeline

The boxes labeled “Tonic” were developed by the Tonic team. Though many boxes appear in the pipeline, only a few components are vital to the workflow.

Subsetting

At eBay, everything starts with a listing. The goal of subsetting is two fold — identify the listing IDs that are required to execute all our test cases and plot a course to fetch all the required and auxiliary information associated with those listings. To begin with, we took one domain (item page) and extracted all the regression test cases necessary to certify a release to production confidently. It included even the rare and complex data scenarios. From those test cases, we formulated a set of SQL queries that ran against our Hadoop clusters. The queries included listings from all sites and across all categories based on hundreds of item and user flags. The final output is a list of unique listing IDs that specifically target the domain test cases.

The above-extracted item IDs are fed as input to the subsetter. The job of the subsetter is to plot a course, starting with the main item table. To do that, we topologically arrange the tables to map their dependencies. Next, using the curated IDs as queries, the algorithm goes upstream of the target table to fetch the optional auxiliary data (e.g., bids of an item. It is optional because an item can exist with no bids), followed by going downstream to fetch all required data (these are mandatory, e.g., bids must have an item). Once all downstream requirements are satisfied, the subsetting is complete. We call this referential integrity.

Anonymization

Once a set of production tables is identified from which data will be copied, the workflow alerts our information security and privacy teams, and the pipeline is halted. It is a deliberate step to ensure that none of the data leaves the production zone without the review and approval of our security and compliance systems. It only happens when a new table is recognized or an existing table is modified. So our daily runs (explanation comes below), configured only with previously approved tables, are mostly uninterrupted. There are a set of PII-related columns within a table that are by default flagged to be anonymized.

The first step for our information security team is to go over them and flag more columns as appropriate. They have their own set of criteria based on international compliance rules and policies, which by just looking at the data may not seem obvious. This process flagged approximately 27% of our columns.

The second step is to take a sample of anonymized data and verify if the standards are met. The information security team’s process is a mix of both automation and manual verification. Since the process is triggered only for new tables, it was not a hindrance. Establishing this tight feedback loop and stopping when in doubt setup helped us ensure that privacy is always preserved.

The technical novelty of the anonymization is important to highlight. We cannot apply some random encryption to anonymize the data. We need Format-Preserving Encryption (FPE), where data in one domain maps to the same domain after encryption, and it should not be reversible (e.g., encrypting a 16-digit credit card number yields another 16-digit number). In eBay’s context, this becomes very critical; or else, most test cases would fail. Using Feistel network and cycle walking, we can create a bijection between any domain and itself, e.g. the domain of 16 digit credit card numbers.

Merging and Post Processing

The anonymized data moves from the production zone to the staging zone adhering to all our firewall protocols. Now comes the merger, whose primary responsibility is to insert the subsetted anonymized production data into the corresponding staging tables. In actual implementation, there is much more nuance to it. For instance, remapping previously migrated sellers to their new items is a complex and costly endeavor. A good side effect of the merger is that it helps identify schema differences between staging and production tables, which did exist due to prolonged staging misuse.

The pipeline does not stop at the merger. There is one more important step, which we call the “Postdump Processor.” Once the data is inserted into the staging tables, this component fires a series of events. The goal is to orchestrate a sequence of jobs to penetrate and normalize the newly migrated data throughout the staging ecosystem. The Postdump Processor includes tasks like notifying the search engine to index the new listings; mapping items to existing products; uploading listing images to staging servers and updating endpoints; using staging salt to hash user credentials; and a few more. We piggybacked most of the existing async events triggered when an item is listed on eBay. A few new ones were created just for the pipeline use cases. This post-processing step is what makes the data relevant.

Discovery and Feedback Loop

Now that high-quality data was made available in staging, a way to exclusively query them for all automation needs became paramount. We have existing APIs to fetch items, users, orders, transactions, etc. However, all of them were built with a customer and business intent in mind and not how developers or quality engineers would use them in their automation scripts. Just like the difficulties of using existing APIs for data creation, there is no straightforward way, for instance, to fetch a bunch of items that have more than 10 SKUs and 40 images. It becomes an arduous process. To solve this, we created a Discovery API and UI tool (codenamed Serendipity), which makes it seamless to integrate with all automation scripts. The API only queries the migrated data that are watermarked with a special flag during migration. The filters in the API are targeted toward how engineers write test cases without worrying about entity relationships or microservice decoupling.

The final aspect of the pipeline is to create a healthy feedback loop between its two ends — production ID curation and staging discovery. The way we achieved it is by adding observability to the discovery API. When a fetch returns null or low results, it immediately signals the dataset curation system to migrate those items in the following pipeline run by executing the corresponding production SQLs. Similarly, when new product requirements come in, developers can request those filters in discovery, which translates to SQLs on the curation side. A self-serving pipeline, indeed.

Expansion

What started as a proof of concept with one domain, 11 tables, and a few thousand items has expanded to the whole marketplace. Today, we have over a million high-quality listings in staging, along with its associated upstream/downstream dependencies. They serve the automation needs of a majority of our applications. Every day, 25,000items/orders are migrated from production to staging, and the data is spread across 200+ tables, 7,000 categories, and 20 different DB hosts. Beginning this year, we expanded the pipeline to NoSQL databases. This includes MongoDB, Cassandra, Couchbase and eBay’s open-sourced NoSQL offering Nudata. The pipeline architecture is the same for NoSQL, with the curated listing IDs used as keys for subsetting.

The pipelines themselves are parallelized at a macro-level — multiple pipelines running on dedicated machines, creating redundancy on failures, and at a micro-level — each component is multi-threaded when possible to execute faster. The pipeline runs every four hours and, on average, takes 65 minutes end-to-end. We have dedicated pipelines for migrating new tables, so it does not impact daily runs. Purging happens on item expiry, similar to production. There are also daily purge jobs to clean up the auxiliary data.

Not all test cases will be covered by data migration. There are use cases for which data creation is required. They are transient data primarily associated with users. Migrating data for these scenarios will be an unnecessary overhead. For this, we have created a new tool called the data creation platform, which again seamlessly integrates with automation scripts.

Please watch this space for a detailed article on the data track.

Conclusion

That’s a wrap for part one. In this post, we started with a dichotomy if staging should exist or not. We explained why we decided to pursue staging, outlined the problem statements and discussed how we addressed the first problem, which is data. In the next and final post that will publish on Dec. 8, we will go over how we brought infrastructure stability, turned the vicious cycle into a virtuous one and finally talk about a system we put in place that will prevent us from regressing.