End-to-end test probes with Playwright

What are automated end-to-end tests? Do you need them at all? In this blog post we dive into the ugly behind automated end-to-end testing, what we struggled with at Zalando, what worked well for us and our latest solution with end-to-end test probes.

Automated end-to-end tests continue to polarise the industry, with some leaders advocating for them and others rightfully questioning their return on investments and recommending to invest in monitoring and alerting systems instead.

Tweet on end-to-end testing from @GergerlyOrosz on May 19th, 2024

Of course, the right approach always depends on your product and the impact of your application being unavailable for even a short period of time. At Zalando, the disruption of a critical customer journey can quickly add up to millions in lost revenue so there is an obvious value for us in ensuring the high quality of our releases and automated end-to-end tests are one of the best tools for the job. So when we release new versions of our Zalando website multiple times a day in a completely autonomous manner, each release goes through an automated quality assurance pipeline that includes end-to-end tests written with Cypress.

What are automated end-to-end tests?

Automated end-to-end tests simulate real user interactions with an application to ensure that the entire application stack works correctly from the user interface to the backend. These tests typically run in a headless browser environment and are thus easily integrated into continuous integration and delivery (CI/CD) pipelines. By automating these tests, teams can efficiently detect and address issues early, ensure regression testing, and maintain application quality as the code base evolves.

Investing in automated end-to-end tests

It really paid off for Zalando and helped us find bugs early on that would otherwise have caused major incidents. It has not been all nice and shiny though as we experienced what Gergely was complaining about: the tests were taxing to maintain and the most frustrating part of it all was that they were still a bit flaky. They had a success rate of around 80%, but with around 120 builds a day, that still meant an average of 24 builds a day which were failing as false positives, causing unnecessary friction.

We doubled down on our investment in these tests, which included creating better test setup context as we have highly dynamic content on Zalando and our product pages are highly contextual, sometimes with products not yet released to build anticipation and for which we obviously could not trigger the add to cart flow. We also improved our selectors and added a mechanism to detect when our pages are hydrated with React after server-side rendering, as Cypress would fail eagerly executing test scripts on a non-interactive UI. Our efforts increased the tests reliability to the 95% range and we felt pretty good about it.

A new class of issues

You can imagine our disappointment when we had a major incident due to front-end interactivity issues where React hydration crashed on a large number of our product detail pages, preventing users from selecting product sizes and adding products to their shopping carts. The issue was large enough to have a business impact, but not just not enough to trigger an automated alert. How did this regression sneak in? It turned out that the incident was triggered by new and incomplete content published to our headless CMS which broke the front-end API contract with our API gateway and ultimately led to broken interactivity. We had have React error boundaries in place, however it turned out that these weren't working for the eagerly-hydrated part of our product pages.

So we were almost back to square one: no matter how much we had invested in our end-to-end test automation, external factors could still lead to broken pages. Obviously, we will tighten up our monitoring and alerting as part of the incident process which seeks to systematically address contributing factors, but we also wanted to catch such interactivity issues more consistently. An idea came to mind: why not run our automated end-to-end tests periodically and alert when they fail? However, remember we had only achieved a 95% success rate with our end-to-end tests, so if we were to run them every 30 minutes to ensure that our website was working as expected. If we were to page our on-call team upon failures, alerts would trigger several times a day and possibly at night, leading to incident fatigue for the on-call team – a state we did not want to be in. So we needed to further increase the reliability of our end-to-end tests if this was to become a viable solution.

A simpler and better approach

We went back to the drawing board: what we needed was higher resiliency and one of the ways to achieve this is often through simplification. We decided that for the end-to-end test probes we would run a cron job with scenarios covering critical customer journeys. We started with a few scenarios: one test would cover landing on our home page, browsing to a gender page and clicking on a product, another would cover landing on our catalog page, applying a filter, clicking on a product and a final one would cover landing on a product page, selecting a size, adding the product to the cart and starting the checkout process. By focusing on a smaller number of features and interactions, we were able to reduce the likelihood of false positives.

Around the same time, we also held our internal Zalando Engineering Conference and one of the talks was about scaling automated end-to-end testing. Playwright, an end-to-end testing solution developed by Microsoft was presented as a great solution for this thanks to its strong focus on resilient testing. Indeed, Playwright features:

"auto-wait" (no artificial timeouts)
"auto-retry" (web assertions), eliminating key causes for flaky tests
rich tooling options (tracing, time-travel) to debug and fix issues if failures occur
a unified API which works across all modern browsers
Typescript out of the box

This was very compelling so we decided to use Playwright for these end-to-end test probes.

It was easy to get up and running with Playwright, especially for our now simple scenarios. We used fixtures to set up independent test contexts for scenarios such as getting a good product candidate for the product page landing test and disabling our cookie consent banner. Playwright's API was simple to pick up, making use of promises natively and augmenting standard CSS selectors which made us hit the ground running super quickly. Here is the final code for our catalog landing test which is only a few lines of code:

test("Test catalog landing journey for zalando",async({page})=>{ // navigate to catalog page constcatalogNav=awaitpage.goto(catalogLink); expect(catalogNav?.status()).toBe(200); awaitexpect(page).toHaveURL(title); // we only wait to simulate a "real user behavior" // with playwright this is not necessary awaitpage.waitForTimeout(1000); awaitpage.getByRole("button",{name:/farbe/i}).click(); awaitpage.locator("label[for=colors-BLACK]").click(); awaitpage.getByText(/speichern/i).click(); awaitexpect(page.getByTestId("is-loading")).toBeVisible(); awaitexpect(page.getByTestId("is-loading")).not.toBeVisible(); awaitpage .locator("article[role=link]") .locator('a[href$=".html"]') .first() .click(); awaitpage.waitForLoadState("domcontentloaded"); awaitexpect(page).toHaveURL(/\.html/i); });

We set up the tests to run on a 30 minute cron job and instead of paging immediately when they failed, we created a low-priority alert that emailed the team to validate their reliability using a "shadow" mode. And it did trigger a couple of times, especially over the weekend. Each time we captured HTML reports as logs so that we could understand the issue, improve our selectors, implement local retry loops with expect.toPass and even cover tricky edges with selectors targeting non-visible content thanks to Playwright's automatic augmentation of pseudo-classes like :visible. After a few weeks, we stopped getting alerts in shadow mode and enabled paging when those tests failed. So far they have only paged us once, and that was during an incident where the page was actually not working.

Outlook

It has been quite a journey to get to where we are now, but we feel pretty good about our setup, which we could not have achieved without focusing on simplicity and betting on Playwright's reliability. If, like us, having production downtime is damaging to your business, we believe that implementing end-to-end test probes could be a useful addition to your toolkit. Our main advice would be to keep these tests focused on your critical customer journeys, write good selectors and iterate in a shadow mode before alerting in production.

We are planning to increase the number of scenarios for the end-to-end probes to include more of our Critical Business Operations (CBOs) and we also looking at extending this idea to our mobile apps.

We're hiring! Do you like working in an ever evolving organization such as Zalando? Consider joining our teams as a Frontend Engineer!