How and why we built a custom gradient boosted-tree package
In order to make accurate and fast travel-time predictions, Lyft built a gradient boosted tree (GBT) package from the ground up. It is slower to train than off-the-shelf packages, but can be customized to treat space and time more efficiently and yield less volatile predictions.
Image by ejaugsburg from Pixabay
Introduction
Machine learning runs at the core of what we do at Lyft. Examples include predicting travel time between two locations, modeling the probability of a ride being canceled, forecasting supply and demand, and many more. These models enable us to match riders and drivers more efficiently, incentivize drivers to be where they can get more rides, and improve the ride experience. There are many more examples that together enable us to provide better service to our customers at a lower price.
One important machine learning model we run at Lyft predicts travel time. In this blogpost, I describe how my team developed a custom gradient boosting decision tree package to push the accuracy and efficiency of this model. We’ll also learn a bit about gradient boosting decision trees and how to use location data in your machine learning models.
Location, location, location!
One of the key features used in many of our models at Lyft is location (for example, passenger location, driver location, or rider destination). In the travel time estimation model, the main features are time, start location and end location of the ride. Location is typically represented by two float values, latitude and longitude, which can be fed directly into a model. However, doing so in models such as logistic regression, gaussian processes, or neural networks typically leads to suboptimal results as models find it hard to extract key features out of the latitude and longitude. For Lyft, these features could be the neighborhood or venue the request is coming from. In order to describe these you would have to split space into regions of very different shapes and sizes (think of rural vs. urban neighborhoods). Neural networks that are trained on batches of data seem to struggle to find fine-grained spatial splits that are relevant to only a small subset of the data. To overcome this, it’s better to cluster locations in a way that makes sense for your problem and provide the network or linear model the cluster ID as a feature (common space-partitioning we use at Lyft are geohashes and s2 cells). However, these clusters are not optimally adjustable for our problem and if you need space to be covered by clusters of very different granularities this becomes very tricky.
What are GBTs and why use them?
In order to overcome this problem and take advantage of the spatial features in our predictive models, we typically use gradient boosting decision trees (GBT) at Lyft. GBT is a powerful model that can learn how to split space (as well as other features) very efficiently based on the data it’s trained on. GBT is an ensemble of simpler models called a decision tree (or regression tree for regression problems). This model splits the dataset into finer and finer subsets (called leaves) based on the provided features and then fits a constant that minimizes the loss function for the dataset in that leaf (see Figure 1 below). Unlike neural networks, decision trees are not trained on batches of data but on the entire dataset at once. This allows the trees to perform accurate fine grained splits of the data. When training a GBT, it performs a gradient descent process, where at each step we estimate the gradient of the loss function using a decision tree. We then advance our estimation of the target by the prediction of that decision tree multiplied by a learning rate. The resulting model is then a sum of decision trees, meaning that the final prediction is the sum of predictions from each tree multiplied by the learning rate. If you’re interested, there are many online resources you can read about GBT’s. I’ve found Jason Brownlee’s post super helpful and full of useful links for further reading.
Figure 1: Decision tree training and concepts
When using GBT’s you can simply provide the location coordinates to the model with no preprocessing and get decent results. With modern architectures, such as LightGBM, XGBoost or CatBoost, you can train GBT’s very quickly on very large datasets, and unlike neural networks they perform well with very little tuning. On the other hand, GBT’s struggle with very high cardinality features such as user ID and don’t have the equivalent of transfer learning that make neural networks a powerful tool in many fields such as NLP and machine vision. Nonetheless, their ability to split space efficiently and the little tuning they require, make them the go-to model for many teams at Lyft.
My team tested using both neural networks and GBT’s for estimating travel time. For the reasons I mentioned above, the neural network model under-performed against the GBT model and we ended up using LightGBM. After shipping the initial model, we immediately thought of ways we could improve its accuracy. This was crucial for us because travel time is a key feature in every matching and pricing decision our system makes. When you make a ride request, Lyft prices a ride based on the estimated travel time to your destination and then matches you with a driver based on the estimated travel time of nearby drivers to your location. This means that every 1% of travel time estimation accuracy (typically measured in mean absolute error) has a tremendous impact on the efficiency of Lyft’s marketplace and the quality of the user experience we provide. When working with GBT’s, we realized that existing architectures lack a few features we deem desirable for tackling spatio-temporal problems (see below). With this realization we embarked on a journey to create our own in-house GBT framework, which we call GeoBoost.
GeoBoost - custom decision tree for spatial and temporal features
GeoBoost is a python based library, with a set of cython functions for fast training and inference. The key feature in this library is having a modular structure of splitters and predictors. Splitter is a class that splits the samples in a branch of the tree into two leaves. Typically GBT libraries split continuous features by choosing a float threshold and categorical features by choosing a single class. By generalizing the splitter class we could easily add different kind of splits more suitable for spatio-temporal data, such as diagonal splits or gaussian-based splits for space and periodicity-aware split for time of day or day of the week features (see Figure 2). Predictor is a class that estimates the loss function at every leaf. By generalizing this class we could easily test using a linear function of some variables, instead of estimating the loss using a constant.
Figure 2: Spatial and temporal splits in GeoBoost vs. standard GBT packages
Thanks to this modular structure we’ve been able to add several features that don’t exist in the standard GBT libraries:
- Special splits for space and time: As I described above, using the modular splitter class, the GBT is able to perform more accurate splits of the spatial and temporal features. Thanks to these, the model is able to converge faster (with less trees) and obtain higher accuracy.
- Fuzzy splits: We added the ability to split between the two branches in a fuzzy manner. This means that samples that lie close to the splitting threshold are sent to both branches with two different weights that sum to 1. This feature was able to dramatically reduce the variance in the model prediction when the start and end location is perturbed slightly. This kind of spatial volatility can cause our dispatch system to make suboptimal decisions, such as swapping the driver assigned to a ride unnecessarily. The fuzzy splits were able to decrease the spatial volatility by 80% (measured by standard deviation of the predictions).
- Learning a linear function at every leaf: In our case, it made sense to learn a linear function of some variables at every leaf. This was achieved using a special predictor class, and enabled our GBT to converge faster (using less trees) and yield higher accuracy.
- Efficient treatment of high-cardinality features: We were able to add a special splitter that treats high cardinality features, like the s2 cells (trapezoidal-like splits of space coordinate of roughly fixed size) that the predicted route passes through. In a typical GBT architecture this would require adding many columns to the dataset. Using a special splitter class, we’ve been able to directly use the list of ids in a sparse manner.
It is worth noting that as a result of using more sophisticated splitters and predictors, GeoBoost trains slower than XGBoost and LightGBM which are highly optimized to train quickly on very large datasets. In our case, training time wasn’t a big factor and the datasets were limited to millions of rows and tens of features.
You can see the impact of adding Gaussian and diagonal split in space in Figure 3. In this figure we trained a GBT to predict travel time in the New-York metropolitan area using very basic features (source latitude and longitude, destination latitude and longitude, hour of week and hour of day). We trained both LightGBM and GeoBoost on a small set of 5 trees and tested the spatial splits obtained on the destination coordinates. As you can see in the figure, GeoBoost is able to make more meaningful splits that are better aligned with the map information. This is reflected in Figure 4, where we test the mean absolute error (MAE) of the travel time prediction as we continue the training to 100 trees. GeoBoost is able to converge much faster, which signifies that the more complex splits are useful in reducing the error. When doing a proper parameter search with a validation set, we find that GeoBoost is able to reach 1% lower RMSE (when trained on L2 loss). When combined with other new features we considered such as fuzzy splits and fitting a linear regression model in some of the leaves we were able to push this gap a little further, while reducing the size of the models by a factor of 4.
Figure 3: Spatial splits of the destination location obtained from a 5-tree GBT in LightGBM and GeoBoost, when trained with the same parameters to predict travel time in the New York metro. The colors are assigned randomly to each leaf.
Figure 4: Travel time MAE (divided by the minimum obtained value) for LightGBM and GeoBoost, when trained with the same parameters to predict travel time in the New York metro.
In addition to working on the Python-Cython based library, we created a Golang-based serving library to serve the model artifacts trained in python. This enabled us to serve the model we trained as part of our Golang-based service, yielding almost 10x speed-up in comparison using a separate Python service to run the machine learning models and calling it from the Golang service. This together with the reduction in the number of trees helped us reduce the cost of serving our models.
What’s next?
To summarize, we’ve created a modular GBT library, called GeoBoost, to improve our travel time prediction. By customizing the splitters and predictors for spatio-temporal problems, we’ve achieved an increase in accuracy, a reduction in prediction spatial volatility, and a reduction serving costs. The next step for my team would be to extend GeoBoost to other use-cases within Lyft, which might require more unique features.
If you find this interesting, don’t hesitate to reach out via email.
If you want to join us at Lyft and help push the boundaries of machine learning in ride-sharing and transportation, check out our job posting or university program pages.