How to Measure Design System at Scale

The Uber Rider app launches features simultaneously on a global scale, changing details across hundreds of screens using thousands of feature flags. It is no longer possible for any designers, engineers, quality assurance, or product managers to fully visualize every single user flow. Uber needs an observability system of similar scale for measuring design quality to prevent subpar user experience, especially when it comes to adopting the existing UI libraries and accessibility best practices packaged under the Uber’s Design System, Base. Without such an observability system–let’s call it Design System Observability–it could be too late when Uber learned through complaints and public media about the end users who would suffer confusing onboarding rides, inconsistent layouts, and frustrating voiceovers/talkbacks sessions.

Design System Observability consists of two main components: an eye and an ear.

It is often hard to tell by the naked eye the differences between the design specs handoffs and the final implementation in the actual apps.

Figure 1: Designer-eye challenge. Without the marking, very few could tell the “Confirm your pickup spot” sheet and the back button implementations (left) were not up to Base design hand-off (right).

Figure 2: All Base UI library components were marked as green while one-off custom components were marked as red. This helped both designers and non-designers see what can be made with Base.

At Uber, we strive to reuse components if they have already been built. Hence, it has become critical for us to measure this very important metric consistently like any other important engineering quality metric such as test coverage, downtime, latency, etc. Base is Uber’s design system with shared components across Design and Code. This provides a consistent user experience, with a reduced learning curve, accessibility, etc., coming for free. Based on internal research, benefits of using Base components include 3X faster development, 4X fewer visual parity issues, 50% less code, than using custom components. Moreover, future changes like theme and typography updates only take a few lines of code changes then rollout in matters of weeks, not months. Thus, it is important to visualize and measure Base adoption.

Figure 3: Deterministic counter to rewards usage of Base components and discourage usage of one-off custom components in a consistent manner. Non-Base implies a layout looks-like-Base-but-not-Base.

Users turn on Base Counter, see different elements on the screen getting highlighted, and understand what can be improved. This is a first-of-it-kind deterministic measurement with visual highlights for design system adoption. Instead of a small group of experts who understand design quality, now thousands of people from different functions can measure and start work items to improve their own screens, making quality at scale possible.

Figure 4: The counting algorithm traverses view trees looking for known component classes.

There are three major steps in the Base Counter tooling: Trigger, Counting Algorithm, and Decorator.

Trigger for counting: We use an internal framework to capture the screen-level navigation, and trigger the start of the Base Counter. Periodic screen updates automatically trigger the base counting and help us in quantifying the Base stats for a screen without additional manual intervention. Using screen change as a trigger ensures that we are running the Base Counter only when it’s required.

Counting algorithm: Once the tool receives a trigger it starts counting the Base stats for the screen. We first need to figure out which screen is currently being shown to the user. Starting from the application window, we find the topmost view controller that is being shown to the user. Once we get the top view controller, we start the postorder view DFS (depth-first search) traversal, starting at the root view of the top view controller.

Decorators were built to provide support for visualizations to developers, and to also have extensible architecture, which can support future cases when they arise. Coloring of view nodes after identification is done by using the coloring node decorator. All decorators implement the decorator protocol.

Figure 5: Visitor implementations are used in order to perform the DFS post order traversal.

Figure 6: The same deterministics counters were used across designers and engineering’s workflows. Everyone having the same goal of 100% is the key for quality at scale.

Our teammates often ask each other questions like: “Have you heard about that new home screen launch for the India market?” or “When did that Uber shortcuts experience launch?”

In apps like Uber, it’s common to have a wide variety of user experiences coexisting, due to its massive user base spread across different regions. The feature teams are also very scattered across Uber offices around the world, and work independently. Each feature team conducts A/B experiments to determine the design that offers the best user engagement. Additionally, legal requirements may necessitate displaying certain screens in specific regions of the world. This makes it challenging to objectively tag a screen with a single Base stats metric at scale, as changes in its UI composition results in different Base stats for the screen.

We experimented with various aggregation techniques and ultimately decided to use the mode of all the baseline metrics as the single metric for tagging a screen. The mode automatically captures what the majority of users are actually viewing on your screen.

With the Base Counter tooling we have the tool required to calculate the score, but we still have to manually go count this on each screen. We need a pipeline to measure thousands of different screens.

We deployed two complementary approaches to automation. The first approach gathers the broad statistics through analytics events triggered by default for all internal testers. Another approach using testing frameworks to take screenshots and run in-depth analysis including which components were used as well as known custom components we want to track.

Figure 7: Broad statistics were aggregated across all users, providing us with crowdsourced data that reflected real-world usage.

Figure 8: Existing E2E test suites were utilized to take regular screenshots and calculate detailed design metrics

With each screen having a defined quality score, we can track it on daily CD builds. Any violations would result in an automated Jira ticket assigned to the respective screen-owning Design and Engineering managers. Combining this automated process with a human Feature Review Readiness process where stakeholders verify the scores, we will ensure that future developments do not degrade and only improve design quality.

The new eye and ear have helped every product team of Uber Rider drive towards the same goal.

Figure 9: Real impacts where many engineer teams improved Uber screens in seamless coordination thanks to Design System Observability (measured in June 2024, Rider iOS app).

Figure 10: When hearing about an accidental design degradation due to an infrastructure migration in early 2024, our engineer teams jumped in to fix it quickly.

We believe Design System Observability is a must-have for any technology organization that needs both velocity and quality. Here are our learnings:

While all teams expressed enthusiasm for applying a Design System, most have competing priorities. Hence, it is important to have a shared and trackable OKR.
There’s a lack of shared understanding about what a Design System truly encompasses. Some believe that simply using a text style or a color style is enough, while others attempt to recreate a component’s appearance without leveraging the existing code. Defining a Design System metrics meant to define an organization’s design quality expectations.
Always assume people have good intentions and degradation is caused by the lack of guardrails. The earlier an issue is caught in the product development pipeline, the fewer days it takes to fix it.

At the end of the day, all Uber teams want to improve user experience, but often get lost in finding design resources or reaching out to other teams. These metrics were excellent conversation starters for both the Design and Engineering teams to improve their checkpoint and handoff experiences.

Our biggest win has been elevating design metrics to be as important as engineering and business metrics. It was a journey of steady progress, starting with building awareness around Base, our design system. We hosted Base Race challenges to encourage adoption, and eventually secured executive buy-in for adoption push. Trust was built through countless hours of manual audits with domain experts. We addressed pushbacks, refined our methodology, and aligned more managers. Today, what once took hours of manual work each month is now an automated system accessible to everyone, highlighting the impact of our design system journey.

Figure 11: Design System metric is now a trackable business-impact metric

Above was our work with the Rider Android and iOS apps, we will apply the approach to the broader Uber product portfolios. In the future, as Uber constantly evolves different technology stacks within Android, iOS, and Web, the Design System Observability will be an effective system to stop reinventing UI components, launching accessible features, and rolling out high-quality design to thousands of screens quickly. We also have many partners in the company from content designers to researchers who are interested in building on top of our system to deliver best practices at scale. More metrics to come.

Special thank you to Mohit Gupta, Reshma Naik, Anukalp Katyal, Arun Babu A S P, Lucia Pineda, and other colleagues for your technical contributions to the Design System Observability infrastructure.

Last but not least, this project can’t happen without the love and attention to details of our Design System team at base.uber.com and the support of the Uber Design organization.