Improving Uber Eats Home Feed Recommendations via Debiased Relevance Predictions
Uber Eats’ mission is to make eating effortless, at any time, for anyone. The Uber Eats home feed is an important tool for fulfilling this goal, as it aims to provide a magical food browsing experience by leveraging machine learning techniques to build a personalized list of stores for each user. For example, if a user frequently orders sushi dishes, the feed will adapt by showing them more Japanese restaurants, especially those with highly rated sushi dishes. The personalized recommendations may also contain suggestions for similar-but-novel options, like seafood or other Asian cuisines.
In order to achieve a high quality personalized feed, a metric we prioritize is accurately estimating conversion rates (abbreviated as CVR in the remainder of this blog post), which denotes the probability of eaters ordering from a particular store after it is shown to them on the home feed. In order to estimate this quantity, we resort to an ML model trained on user interaction data such as user impressions, clicks and orders. However, the interaction data itself does not always perfectly reflect our users’ preferences, as it suffers from a wide range of statistical biases. In fact, since ML models are only as good as the data they are trained on, these statistical biases impacting our interaction data can have a strong detrimental effect on the quality of rankings our models generate. When we use the term bias in this blog we’re referring to statistical bias.
In the recommendation systems literature, multiple biases impacting ranking quality have already been thoroughly studied, such as position bias, trust bias, quality of context bias, selection bias, neighboring bias, feedback loop bias, etc.; see [1-3, 5, 9] and references among them for details.
In this blog post, we focus on tackling arguably one of the most important such biases: the position bias. Position bias refers to the phenomenon in which users tend to order more from stores ranked higher compared to stores that are ranked lower, irrespective of how relevant that store truly is to the user. As we further discuss in the subsequent sections, position bias is prevalent in Uber Eats’ home feed ranking problem space, and without any special treatment, our CVR model trained on biased data fails to perfectly capture users’ real intentions, since the model is not able to distinguish the impact of biases in users’ ordering behavior.
Position bias is a phenomenon that can be described with relative ease, but it is not always as straightforward to accurately measure or visualize. Typically, scholars and practitioners measure position bias by modifying the ranked recommendations in a systematic manner, and comparing these results to the unmodified ones (see [6, 7]). In Uber Eats, we follow a similar approach to measure the impact of position bias. Specifically, we measure position bias using a percentage of our traffic by randomly permuting the order of stores for a significant portion of our home feed such that we can rigorously measure position bias. Since our top recommendations are generally highly relevant to each user, this rearrangement of feed for a small percentage of traffic does not negatively impact the overall discovery experience.
Due to this selective randomization, for this small percentage of traffic the expected true user-store relevance is identical for the stores that are positioned in the top positions of our feed, rather than being rank ordered by relevance. Therefore, if we investigate the empirical CVR (defined as the number of orders divided by the number of impressions) for each vertical position, any deviation we observe from a line parallel to the x axis can be explained by the impact of preexisting biases in our data, including position bias.
Figure 1 shows empirical CVR plotted against store vertical position for this small proportion of traffic where store position is randomly permuted. As can be seen, even when the expected store relevance is identical across vertical positions, there is still a clear user preference to place orders from stores at the top of the feed compared to lower ranked ones, demonstrating the significant impact of position bias in user behavior and in our data. Consequently, our relevance ranking models may be suffering from a positive feedback loop, whereby stores near the top of the feed get more orders, encouraging the model to learn that they are relevant, which then ranks them highly in future sessions (naturally, the opposite effect is observed for stores that are low ranked). Based on this observation and the magnitude of the measured bias, we deduced that there is clear headroom to improve the quality of our CVR model predictions by employing an approach to remove or reduce the impact of position bias in our data.
Figure 1: Smoothened empirical CVR vs. vertical position on randomized traffic
Our ranking algorithm relies heavily on an accurate prediction of each user’s preference to each store, which we represent as the conversion rate for a [user, store, context] impression. Ideally, we would like to estimate an unobservable quantity True CVR, which is the ground truth measure of how relevant a specific store is to the user in the given context regardless of any other confounding factors that may be present. However, in the impression data that we collect, impressions occur at different positions and surfaces even for the same user-store pair, and each of these impressions will suffer from position bias to different extents. Consequently, when we train our CVR model on this biased data, the model will be learning a Biased CVR as opposed to the desired True CVR. Our goal is to bridge this gap, and allow our models to learn only the True CVR as opposed to the Biased CVR.
To build a debiased ML model for predicting True CVR we must determine a way to model the relationship between True CVR, Biased CVR, and position bias. The problem boils down to understanding why users are less likely to order from lower positions, even for the same store. As we thought about this problem, we realized that there is a missing step in the chain from an impression to an order: examination. An impression occurs when a store is shown to the user, or rendered, in the Uber Eats app or web browser. However, even if the store is physically rendered in pixels, that does not mean that the user saw those pixels or inspected their content. One step down the intent ladder, an examination occurs if a user deliberately chooses to inspect the listing with intent to potentially place an order from the store that is shown to them. An examination does not require a click. It could entail simply looking at the store name, cuisine type, image, or inspecting what dishes are available inside the stores. Crucially, not every impression leads to an examination, and not every examination leads to an order. We describe this phenomenon using an examination model, where we breakdown a user’s entire process of ordering into 3 stages, as shown in Figure 2:
Figure 2: The Examination Model
There are many theories and frameworks explaining the mechanisms governing how impressions may convert to examinations, see [2, 4]. For our specific use case we believe that the two most crucial mechanisms determining whether an impression will lead to an examination are the physical nature of sequential recommendations and attention decay. Due to the sequential nature of recommendations and users’ tendency to read from top to bottom, higher ranked stores are more likely to be examined first, and once a user finds a suitably relevant store those ranked lower are much less likely to be examined. Attention decay refers to the user tendency to inspect top stores more carefully compared to lower ranked stores, and thus gradually paying less attention to the recommendations as they scroll vertically. Though not the only factors, these two phenomena largely drive the observed “position bias” that determines whether an impression will convert to an examination.
The vertical position of store impressions plays a crucial role in the phenomenon of position bias. But surprisingly, the position itself is not the only factor determining the magnitude of position bias. Position bias can be influenced by many secondary factors. For example, different operating systems (OS) and different devices have different user interfaces and layouts, leading to differences in how vertical position is perceived and how it affects position bias. Similarly, whether a store is presented as a single store card or a collection of store cards as in a carousel also plays an important role in how willing users are to direct their attention to each recommendation. Figure 3 below illustrates how device OS and feed item type impact observed position bias as measured on randomized traffic. The fact that position bias can be a function of so many variables, some of which may be difficult for us to observe, suggests that a one-size-fits-all or heuristics-based approach to dealing with position bias may not be sufficient.
Figure 3: Position Bias as a function of Vertical Position, Device OS and Feed Item Type
So far we have described how position bias may affect whether an impression leads to an examination or not. We also demonstrated why we would ideally like to predict True CVR, which is the ground truth probability of a conversion after the store is examined. In contrast, Biased CVR is the empirical CVR probability we observe in our training data that is generated by user impressions. Putting these terms into a more rigorous framework, we obtain the following relationship:
Here, P(Examination), or the probability of examination, is how position bias manifests itself in our data and affects our models’ predictions. Moving forward, we will refer to this probability as Position Bias.
Now that we have extensively described what position bias looks like in Uber Eats’ ranking ecosystem and demonstrated its impact, we are ready to describe how to handle it in the next section.
In order to accurately estimate True CVR, we would ideally like to train our models solely using examination data. Unfortunately we have no way of identifying whether an impression led to an examination and so this is not possible. Instead, we need to come up with a way to debias the impression data in order to generate CVR estimates that are as close as possible to True CVR.
Our team has made several efforts over the years to mitigate the effect of position bias in feed ranking. Our efforts included training the models only on the data collected from the randomized segment of our traffic, using the IPW (inverse propensity weighting) framework to appropriately weigh our data points during the training phase, and utilizing the vertical position directly as a feature during training; as well as other approaches. These various attempts helped us steadily build valuable insights about the position bias problem in our use case and come up with increasingly more effective methods to deal with it. For instance, through these attempts we realized that position bias is not only a function of vertical position, but it is influenced by additional factors such as device OS and feed item types, as established in the previous section. Similarly, we found out that our attempt to deal with position bias should not have a significant impact on the magnitude of CVR predictions, as this can generate downstream issues, in particular related to model calibration.
Based on our past learnings, we deduced that we needed a model which utilizes the interaction of position feature with other features during model training, but not explicitly leveraging it during online inference. This suggests that with some minor modifications, the position-bias-aware learning framework described in [8] is a good candidate for our use case. Inspired by the approach in [8], we built a deep learning CVR model with a position bias side tower, which allows us to simultaneously estimate True CVR and Position Bias under the examination model shown in Figure 2. The architecture of this model naturally follows equation (1), and is depicted in Figure 4. The model comprises two separate Deep Neural Network (DNN) towers that estimate the probability of examination P(Examination = 1) and True CVR respectively. The output logits of these towers are summed up before passing them through a sigmoid, which, in probability space, is akin to multiplying the probabilities, but is computationally more robust. We pose this ML problem as a classification task, and train the described model with binary cross entropy loss. This model is trained on the biased impression data, where each row is an impression of a user on a store, and the label is whether the impression resulted in an order or not.
Figure 4: Deep learning CVR model with position bias side tower. The illustration depicts the overall structure, but is not representative of the specific architecture used in production.
While we do not have data that directly estimate True CVR and Position Bias, it’s easy to see that if the examination model given in equation (1) holds, the DNN may be able to learn the individual components of equation (1), True CVR and Position Bias, as the outputs of its two towers. Once the training step is complete, during online serving, we rank the stores according to the True CVR estimates predicted by the model, which reflect users’ preferences more accurately as they are not corrupted by position bias.
One key learning we identified during our attempts to train this model is that if these two towers share some common features, it is possible that the bias tower may learn some information related to the True CVR quantity, which makes the predictions of the CVR tower less accurate. To resolve this problem, we employed various regularization techniques within the bias tower, like l1 regularization and dropout. Since the architecture of the CVR tower is much more complex than the architecture of the bias tower, we found that once the position tower is properly regularized, it is difficult for the bias tower to learn information related to True CVR. On the other hand, the bias tower has exclusive access to features that are much more related to Position Bias and less so with True CVR, which allows the position tower to much more easily learn Position Bias, compared to the CVR tower**.** This configuration of features helps each tower only learn information respective to their intended tasks, and is key to successfully generating debiased True CVR predictions.
We trained a CVR model following the deep learning architecture described in the previous section. We found out that this architecture does a decent job in extracting the impact of position bias from our training data, as offline analyses conducted on randomized traffic demonstrate that the True CVR predictions of the position debiased model are not correlated with the vertical positions of the observations. In contrast, prediction scores of our production model were correlated with the vertical positions of the observations, since in the absence of position bias treatment, the production model was learning to predict Biased CVR as opposed to True CVR.
Encouraged by these positive early signals, we launched an experiment where we compared our newly built position debiased CVR model against our existing production CVR model. The experiment confirmed our promising offline observations. We observed that the new position-debiased model generated a much more relevant Uber Eats home feed, as users statistically significantly placed more orders from home feed and had to go to search functionality less. Furthermore, this substantial increase in home feed orders translated to a large topline business impact. We observed a statistically significant increase in orders per users on our platform, indicating that our new model does a better job in understanding what our users truly want and generates a more appetizing feed in general, leading to more orders from our users. Based on these findings, we rolled out the model to public users and concluded the experiment successfully.
In this blog post, we presented how we at Uber Eats understand position bias using the examination model, and how we estimate it by training a DL model with a position bias side tower. Removing the effect of position bias from our biased data helps our models more accurately estimate the true relevance, generating much more relevant recommendations for our users, and increasing user engagement.
During this effort, we primarily focused on position bias as a leading source of bias in our training data. However, we underline that even though position bias is one of the most significant sources of bias in our ranking model, it is not the only source. In particular, the neighboring bias and the selection bias are two other examples of biases we believe our model suffers from. In the future, we plan to continue enhancing our users’ content discovery experience by tackling these and other bias problems.
[1] Wang, Xuanhui, et al. “Learning to rank with selection bias in personal search.” Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 2016.
[2] Joachims, Thorsten, et al. “Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search.” ACM Transactions on Information Systems (TOIS) 25.2 (2007): 7-es.
[3] Sinha, Ayan, David F. Gleich, and Karthik Ramani. “Deconvolving feedback loops in recommender systems.” Advances in neural information processing systems 29 (2016).
[4] Klöckner, K. et al.: Depth-and breadth-first processing of search result lists. In: CHI’04 extended abstracts on Human factors in computing systems. pp. 1539–1539 ACM (2004).
[5] Chen, Jiawei, et al. “Bias and debias in recommender system: A survey and future directions.” ACM Transactions on Information Systems 41.3 (2023): 1-39.
[6] Wang, X., Bendersky, M., Metzler, D., & Najork, M. (2016, July). Learning to rank with selection bias in personal search. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (pp. 115-124).
[7] Joachims, T., Swaminathan, A., & Schnabel, T. (2017, February). Unbiased learning-to-rank with biased feedback. In Proceedings of the tenth ACM international conference on web search and data mining (pp. 781-789).
[8] Guo, H., Yu, J., Liu, Q., Tang, R., & Zhang, Y. (2019, September). PAL: a position-bias aware learning framework for CTR prediction in live recommender systems. In Proceedings of the 13th ACM Conference on Recommender Systems (pp. 452-456).
[9] Gu, Y., Ding, Z., Wang, S., Zou, L., Liu, Y., & Yin, D. (2020, October). Deep multifaceted transformers for multi-objective ranking in large-scale e-commerce recommender systems. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (pp. 2493-2500).
The cover image was created by and is credited to another party and is obtained from https://openverse.org/, which indicates the creator. It is licensed under CC BY 2.0. No changes have been made.