Rendering Engine Tales: Road to Concurrent React

Integrating React's Concurrent features into Zalando's web framework. In this post we go over our solution design, early benchmarks, and some useful tips about common hydration mismatch errors.

photo of Rene Eichhorn
Rene Eichhorn

Software Engineer

photo of Ghasem Bakhtiari
Ghasem Bakhtiari

Senior Software Engineer

photo of Jan Brockmeyer
Jan Brockmeyer

Senior Software Engineer

Outfit Page

Welcome back to our web platform blog series! It's been a while since we last talked about our approach to large-scale front-end development at Zalando. We are excited now to reconnect and share with you some substantial enhancements we've made to the streaming and rendering architecture of our Rendering Engine framework.

The first post of this new series will recap how Rendering Engine works, its relationship with Concurrent React, and our journey with it including design and implementation challenges as well as successes gained so far.
Additionally, it covers the main hydration mismatch errors we faced during this upgrade, our solutions and recommendations for avoiding them, and some extra tips and tricks for debugging this type of issue.

Intro

"Rendering Engine" is the web framework that is maintained by and currently used in Zalando to render the Fashion Store website, and is designed for building any web application with similar needs.

You might know Rendering Engine (RE) from our previous blog posts about Micro Frontends at Zalando and our journey through them from Project Mosaic with its fragments and Tailor, to Interface Framework (part 2).

In a nutshell, RE is a web framework best suited for creating a website that:

  • Uses React to render the UI
  • Inherently implements universal rendering (server side / client side) with high emphasis on server rendering and page load performance
  • Its page content, layout and UI steering is highly driven by backend in a nestable approach
  • The backend can be a recommendation engine, a CMS-like system able to define the shape and content of pages, or any other similar system.

The building blocks of RE's language for defining what to render, are Entities. Each Entity is a block of content that from a business-logic perspective has a specific identity, and can have other Entities nested inside. For example in the context of a fashion store, an Entity could be a Product, a Collection of products, an Outfit, etc. Which when organized in tree-like structures, can be used to define full layout and contents of pages. Defining each Entity from the backend is done through specifying a type, id, and optional extra data in the form of hints. We'll skip how RE handles defining layouts from the backend for the time being.

So by considering Entities to be responsible for describing "what to render" (by the backend), then specifying "how to render" is the responsibility of what we call a Renderer (by the client).
Each Renderer is a self-contained TypeScript module powered by multiple RE features provided during server- and client-side rendering. Each Renderer is responsible to render a specific type of Entity, while each Entity-type can be represented by multiple Renderers depending on the extra hints data.

This assignment mapping is defined via something called Rendering Rules. These configurations are passed to RE, which include "selectors" for matching the incoming Entity definitions from backend, and support nested and per-page rules.

There are a handful of other features built into this framework including monitoring, experimentation, tracking, a different rendering output for server driven mobile apps, etc. but for now this introduction should do.

React 18's Concurrent Rendering

(and how it fits Rendering Engine like a glove)

Performance has always been one of the key focus areas of Rendering Engine from its beginnings. Aside from being built with performance in mind and going through many micro improvements over the years, it also comes with some performance features built inside, including but not limited to streaming, lazy-loading, partial streaming and partial hydration (yes, almost the same concept as in Concurrent React!).

Although these performance related features have proven to be very important in the success of the Fashion Store website, their code's maintenance, improvements and required education as well as knowledge sharing come with a cost.

But more importantly, we anticipated having React's built-in support for these features would most probably bring even more performance boosts to the table.

Additionally, React's concurrent rendering APIs seamlessly integrate with the architecture of RE because its Renderers serve as ideal candidates for being encapsulated within a Suspense boundary. This enables them to function as individual blocks that can be server-rendered, streamed, hydrated, and client-rendered "concurrently". Especially since many of them have already been using Rendering Engine's own partial hydration/streaming features!

As a result, we have been very excited about the concurrent React 18 for quite a while and as soon as the opportunity arrived, we started the migration and refactoring of Rendering Engine's core functionalities to use the concurrent features.

Needless to say, this migration task has also had its challenges and costs! So now that we have finished some important milestones and are close to completion, we thought it is a good chance to start sharing our challenges, successes and learnings with you.

Design challenges with Concurrent Rendering

Rendering Engine at its core includes logic for handling the resolution of server's specified Entity definitions or layout into the corresponding Renderers, fetching their data as well as handling all the other aforementioned features like experimentation, tracking, etc. And only after that, it hands over the UI rendering responsibilities to React.
These happen gradually (and if needed, recursively) in a way that makes sure that Renderers remain independent while getting their data and rendering/streaming their final html, which makes way for performance gains.

So initially, with React 18 we thought of moving as much of this logic as possible (from data fetching to experimentation, tracking, etc.) to the React concurrent APIs such as Suspense and useTransition, through custom hooks - which is often referred to as the "Render-As-You-Fetch pattern. With the aim of reducing complexity and required effort among other things.

But after a trial phase and implementing a proof of concept, we faced some issues, the main ones being:

  • In cases where keeping the correct order of the content during streaming/hydration is important, the closest available solution would be to use the SuspenseList API. But it still seems to be experimental, with some limitations.
  • The useTransition API not considering nested suspense boundaries, causing bad UX in some scenarios.
  • By utilizing hooks to initiate requests or other async operations, the timing of fetch operations becomes coupled with the order of rendering, which may not be optimal for performance.
  • Progressive hydration and streaming, necessitate the availability of all the data required for client-side rendering as early as possible. This implies that, in addition to the HTML generated by components, it is crucial to stream their data to prevent redundant requests from being made by the client.
    • During the trial phase, the streaming and caching layer to support this issue wasn't yet handled by React. And as of now, the latest supporting feature is still not final.

Chosen technical design

Due to the limitations mentioned above, we finally decided to go with a mixed solution.

In this approach, the concurrent streaming, hydration, rendering and basically all the Concurrent benefits are still achieved via fully utilizing React: by wrapping every Renderer in a Suspense boundary, and handling changes through concurrent APIs.
But at the same time, we created an "Application State" layer which encapsulates the main logic and Renderers data outside of React components/hooks in a central place, which dictates to the Suspense boundaries their state.

This way, the full power of orchestrating when to suspend a component (Renderer) depending on its place in the tree, handling the order of the suspended components, and deciding how to manage a transition considering the nested Suspense boundaries, would all be available and customizable in this Application State layer.
We will share the details of the technical solution for ordered streaming/hydration in another post.

In other words, everytime RE finds the matching Renderer and resolves all its corresponding data for an Entity definition (through "resolveEntity" step), the output will be written to the Application State layer. In the meantime React is rendering the Renderer components which are wrapped with Suspense.
To access data from the Application State, the suspendable Renderers use the "Connector hook".
The Connector hook reads from the application state which either returns the data that was asked for, or creates a promise that will be resolved once the data has been written. The promise is then used to suspend the component and React will automatically re-render once the Promise has been resolved.
Imagine Redux's useSelector hook, but instead of immediately returning selected data you get a Promise that only resolves once a reducer has made the data available.

Rendering Engine architecture using Concurrent React

Benefits gained from Concurrent Rendering

As we are still going through the changes and final steps of the full-fledged concurrent mode described above, the full benefits of it are yet to be observed.

Till date, we achieved some performance improvements by mainly using the new streaming and hydration root APIs.

Performance improvements from renderToPipeableStream and hydrateRoot APIs

As one of the milestones, after pure version upgrade and handling breaking changes, we solely changed RE's internal streaming and hydration code to use the new React 18 APIs instead. i.e. renderToPipeableStream instead of renderToNodeStream, and hydrateRoot instead of hydrate.
We rolled out this change through an A/B test covering all pages of our e-commerce website, and in the end we observed these mild performance (and business metric) improvements:

Overall

  • INP: -5.69%
  • FID: -8.81%
  • LCP: -2.43%
  • FCP: -0.23%
  • Bounce rate: -0.24%

Per page: (some of the frequently visited pages)

MetricHome pageCatalog page
(list of products and search)
Product Details page
INP-2.92%-6.76%-6.09%
FID-2.98%-17.11%-6.06%
Exit Rate-0.43%-0.06%-0.06%

Needless to say, this shows great promise, and we are now even more excited about the results of the next steps.

Technical challenges: Rise of the Hydration Mismatch errors!

As also stated in some documentations around React 18, because the new React APIs are way more sensitive towards existing hydration mismatch issues, after the migration to the new streaming and hydration APIs, we started receiving a lot more hydration error logs (via Sentry) for Zalando Fashion Store.
So during this migration, we've been finding and fixing these issues to prevent negative user impact as much as possible. And after fixing dozens of different types of issues deep inside hundreds of Renderers, we were able to considerably reduce the number of the hydration mismatch errors occuring in the wild. That being said, there are still some more errors to fix which are harder to reproduce and find due to the dynamic nature of the page content in Fashion Store.
Nevertheless, below you can find the most common issues we found so far, and how we were able to fix them.

After that, we also briefly share some tips and tricks about the debugging process. Because - as you may also know if you have faced these errors in your projects - debugging them is not always a straightforward task, and to be honest, React's error logs (especially coming from the production environment) aren't very helpful!

Main types of issues we faced, and suggested solutions

Before going through details of each type, in some cases we realized that based on product requirements, one might actually not need to render some content on SSR (Server Side Rendering) and only the CSR (Client Side Rendering) would be enough.
Hence the obvious fix might be to just skip rendering on SSR and only show the content once the app is mounted on the user's browser.

To do that, we can rely on React hooks and lifecycle methods to ensure the app/component has been mounted on the browser. For example:

Instead of

  //...
  const { dataThatDiffersBetweenClientAndServer } = props;
  return (
    <div>{dataThatDiffersBetweenClientAndServer}</div>
  );

Do

//...
  const [isMounted, setIsMounted] = React.useState(false);
  React.useEffect(() => {
    setIsMounted(true);
  }, []);
  const { dataThatDiffersBetweenClientAndServer } = props;

  return (
    <div>{isMounted ? dataThatDiffersBetweenClientAndServer : "some fallback" || null}</div>
  );

There are similar cases where due to the basic differences between the SSR and the CSR, like some data only being available on client side, one might need to render different content or elements on the two. For example, based on the exact specifications of the user's device, you want to display an app download banner.

For these scenarios, the suggestion would again be to simply wait until the initial hydration phase is finished on the client side, and then render the different content.

Note: in such cases, be mindful of layout shifts that can happen as a result of some element popping into the view.

With that out of the way, let's dive into the list of issues.

1. Timers

This is a common and somewhat expected source of hydration mismatch issues simply because if you're calculating and rendering the distance between two specific points in time (usually from past/future to now), it will result in slightly different values when calculated on SSR compared to a few moments later on CSR.

As also mentioned in React docs, in such cases where the mismatch is unavoidable, the suggestion is to simply tell React that the difference is expected and that React should ignore the mismatch during hydration. The way to do this is by passing the prop suppressHydrationWarning={true} to the element that contains such a mismatch. Keep in mind that this prop only works one level deep, so you have to pass it to the closest element wrapping the mismatching text. For example:

Instead of

  //...
  const timeDistance = targetDate.getTime() - Date.now();
  return (
    <div>{timeDistance}</div>
  );

Do

  //...
  const timeDistance = targetDate.getTime() - Date.now();
  return (
    <div suppressHydrationWarning={true}>{timeDistance}</div>
  );

2. Localization of dates and different time-zones

Converting date values from raw formats (e.g. ISO 8601 2023-01-01T20:00:00.000Z) to human-readable strings can be a tricky cause of hydration mismatch errors.
Because if the timezone used for conversion is different between the server and client, the resulting values can be different as well.

So for example if the timezone is not specified while using the localization APIs (e.g. Intl.DateTimeFormat or Date.prototype.toLocaleString), then the host timezone will be used and if the SSR server has a different timezone than the user, it will lead to different localized date values in the end.

It's hard to decide what the best solution is in these cases especially because as of now it is not possible to know the exact local timezone of the user on SSR based on http headers (in the initial request).
On top of that, the question of which timezone to use for displaying dates is ultimately a product decision.

But if a specific universal timezone is approved and provided (for example the website's domain's matching timezone), then specifying that universal timezone to the conversion APIs on both the client and server code can fix this issue. Meaning:

Instead of

  //...
  return (
    <div>
      {someDate.toLocaleString(locale)}
      {new Intl.DateTimeFormat(locale).format(someDate)}
    </div>
  );

Do

  //...
  return (
    <div>
      {someDate.toLocaleString(locale, { timeZone: universalTimezone })}
      {new Intl.DateTimeFormat(locale, { timeZone: universalTimezone }).format(someDate)}
    </div>
  );

That being said, depending on the situation and product requirements, an alternative approach would be to just move the conversion to the backend so that the client simply receives dates in the localized format - which has passed through timezone transformation (and localisation).

3. Localization of numbers

(and a Safari bug for "de-AT" locale!)

Similar to converting dates and importance of timezones, when converting raw numbers to localized human-readable strings (e.g. 12345 to "12,345") if the locale is not specified, then the host's locale will be used and it can lead to different results. So it's important to always pass a universal locale to these APIs which is consistent during server and client rendering:

Instead of

  //...
  return (
    <div>
      {someNumber.toLocaleString()}
      {new Intl.DateTimeFormat().format(someNumber)}
    </div>
  );

Do

  //...
  return (
    <div>
      {someNumber.toLocaleString(universalLocale)}
      {new Intl.DateTimeFormat(universalLocale).format(someNumber)}
    </div>
  );

But in very specific cases, we observed that the localisation APIs act differently between SSR and CSR, which again lead to generating different values, thus hydration mismatches!

We particularly encountered this issue with the Safari browser where for the de-AT locale, the localisation APIs (like Intl.NumberFormat or tolocalestring) generate values like "2.345" but other browsers including Chrome and Firefox as well as Node.js generate values like "2 345" for the same locale!

So an alternative approach in these cases would be to receive the final localized values from the backend and show that to the user without needing any more modifications, thus eliminating the mismatches.

4. Invalid HTML nesting

This issue might be a new cause of hydration mismatch in React 18, which happens as a result of incorrect HTML like nesting a <div> inside a <p> or <button> inside <button>. We couldn't find clear documentation from React explaining why HTML validity issues lead to hydration mismatch errors (aside from community discussions like here). But regardless, to avoid them, adding markup validation steps (like this eslint plugin) could be helpful.

Either Way, in such cases the obvious goal is to use semantically correct HTML elements while nesting. For example:

Instead of

  //...
  return (
    <div>
      <p><div>Some text</div></p>
      <button><button>Button text</button></button>
    </div>
  );

Do

  //...
  return (
    <div>
      <p><span>Some text</span></p>
      <button><span>Button text</span></button>
    </div>
  );

Some debugging tips & tricks

Soon after receiving the new hydration mismatch logs in our error tracking system (Sentry), it was clear that the most important first step in debugging them is whether we can reproduce them or not!
Because due to the nature of the React hydration errors in its production bundle, there is not much detail you can get from the error messages in Sentry. Although including the componentStack from the hydrateRoot‘s onRecoverableError callback in the logs comes in quite handy, (especially after cleaning the stack a bit to make it more readable) but due to code minification and uglifying in production bundle of your application, you will still have to carry out complicated tasks and use the provided line/column numbers to find the closest components with the help of sourcemaps.

On top of that, if a website has dynamic content served to each user like Zalando Fashion Store, it may be even harder to reproduce the exact page (with the same content) that was receiving a specific error.

Another issue we encountered was that the onRecoverableError callback is usually called multiple times by React for a single hydration mismatch problem, both polluting our Sentry logs as well as making the debugging process harder.
This seems to be due to the way hydration phase works, in which React compares a list of available server rendered DOM nodes with a list of client rendered React elements ("fibers") and tries to match them together and basically hydrate the nodes. And when matching and hydration fails for a specific node instance and errors are logged, it tries to hydrate the next one. What we observed here was that (at least in some cases) because of the previous mismatching node/fiber, the order of the lists becomes broken, and that leads to all the next ones failing as well. And that means a lot of other hydration mismatch error logs which aren't necessarily correct.
To mitigate this in the production environment, we modified our error tracking code to only send the first hydration error log to Sentry. We also found this to be very helpful to keep in mind during development debugging.

But in case reproducing the error locally is possible, then we found these steps to be helpful:

  • Work on the first error log, and after it's fixed, check if any other one remains.
  • Based on the log and the componentStack, find the closest component(s) causing the issue.
  • In some cases the cause of the issue is obvious in the specified component's source code - for example the issue number 4 mentioned above (Invalid HTML nesting).
    • With HTML nesting issues, the log usually contains the text validateDOMNesting(...).
  • In other cases where the cause is not very obvious, what we found helpful was to check the React dev bundle (react-dom/umd/react-dom.development.js) and put debuggers on places which log the hydration errors (usually the checkForUnmatchedText or throwOnHydrationMismatch functions).
    • Then by loading the page, try to find out what is the exact React fiber that causes the issue, and based on that find the component/element. Don't be afraid to go higher in the stack and use more debuggers!
    • In some cases we realized that the fiber is the same element that caused the issue, but in others, it's more confusing as the fiber is something that was rendered after a mismatching (usually missing) node instance that was the actual cause of the issue.
    • Here it also helps to check different variables like fiber, nextInstance, current, etc. including their received props.

Conclusion

The migration to React 18 and its concurrent features was of extra importance for our Rendering Engine framework due to its unique architecture. And despite the challenges, the results have been promising so far, especially since we observed improvements over Fashion Store website’s Core Web Vitals and bounce rate.

Additionally, the upgrade shined a light on the hidden hydration mismatch issues scattered in different components, which led us to not only fix many of them, but also collect and internally document them along with recommendations and debugging tips for further reference.

Next Steps

We are planning to share more detailed posts in the future about the architecture and technical specs of Rendering Engine - especially in light of the Concurrent features.
Additionally, we aim to share the effects of the new features and the final architecture on Zalando Fashion Store's performance.

Next up, we're excited to start using React Server Components which have shown great promise so far. Stay tuned!



Related posts