EXPEDIA GROUP TECHNOLOGY — DATA

Underfitting and Decision Trees

Understanding model complexity

Ben Dundee
Expedia Group Technology
10 min readNov 13, 2020

--

Photo by Jeremy Bishop on Unsplash

In this article, I’d like to present two very simple examples of models based on decision trees. These two examples helped my team understand model complexity in tree models, and helped us drive a nice A/B test winner for the business. I hope that you’ll get some insight into what day-to-day work is like as a data scientist, while gaining some intuition about the way that tree models work.

My team owns the core ranking algorithm at Vrbo, part of Expedia Group™. (If you don’t know Vrbo, we are a vacation rental marketplace, we want to help families travel better together.) Our current production system serves millions of guests, helping to generate billions of dollars in revenue. If that wasn’t cool enough, we get to help people go on vacation, which may be more important now than ever. Obligatory plug: we’re hiring and you should totally apply.

Our current models are based on LambdaMART, as implemented by LightGBM. If you don’t know about the open source gradient boosted decision tree (GBDT) libraries, LightGBM, xgboost, and CatBoost are all highly optimized and used in production systems at scale around the world. If you’re new to the discipline of “learning to rank”, this paper by Chris Burges is a fantastic primer.

At Vrbo, we believe that families and groups are generally under-served by our competitors, and that the flexibility and space offered by vacation rentals make them a good match for these groups. So it was a big surprise to us when a model update focusing specifically on this use case turned out to be conversion negative in an A/B test.

The model we tested introduced a bunch of new features that we validated with guests, so it should have been an easy win. In diagnosing the problem, we began to suspect underfitting, and we constructed a simple example that showed this explicitly. Our hypothesis was that we weren’t using deep enough trees, and as a result weren’t fully capturing important interactions.

Let’s go fishing

I grew up in Galveston, Texas, and I spent a lot of time fishing. The best days to catch fish were sunny and calm. If I want to go fishing when it’s sunny AND calm, I need to answer yes to two questions. Of the four states in the universe (sunny/calm, sunny/windy, cloudy/calm, cloudy/windy), only one leads to a “go fishing” decision. Hopefully you can see that this is a decision tree:

A decision tree to work out if I should go fishing
Who am I kidding? The correct decision is always “go fishing”.

Now imagine we want to model this decision. We’d build a data set and invoke our tree models. But what should the trees look like? Suppose we only ever allowed the trees to split once by setting the appropriate hyperparameters. The system that we’re trying to model has two questions (Is it sunny? Is it calm?), but the model we’re using only admits a single decision point. We’ll never be able to model this decision because the model complexity doesn’t admit an interaction between the “is sunny” and “is calm” features — textbook underfitting!

A tree with one split cannot model a system with two interacting features.

A less simple example

If you’re not familiar with learning to rank, the main point is that we want to build a model that scores more relevant examples (evidenced in our dataset by guest interactions, or clicks) higher than less relevant examples. If we order the examples by their scores (descending), we should have the most relevant things at the top of the list. To make this more realistic, we could also invoke a ranking metric like MRR, MAP, or NDCG. I’ll leave the calculation of these metrics on this data as an exercise for the reader.

In order to demonstrate this in a ranking problem, we’ll do the following:

  1. Start with a “known” decision tree;
  2. Generate a data set from this tree (no variance, to make it clean);
  3. Attempt to recover the decision tree using LightGBM.

The goal is to engineer a situation where features clearly interact, and see how the model performs as we experiment with hyperparameters.

Our fake data set will consist of three features. Two features (x1 and x2) are continuous but not independent, and one (x3) is categorical. The indicator y tells us if a click has occurred (y=1) or not (y=0). Our rules are:

IF x3 == 1:
* IF x1 > 1.0 THEN y = 1
* IF x1 <= 1.0 THEN y = 0
IF x3 == 0:
* IF x1 > 1.0 THEN y = 0
* IF x1 <= 1.0 THEN y = 1
For x2:
* IF x1 <= 1 THEN x2 >= 1
* IF x1 > 1 THEN x2 < 1

To generate a ranking dataset, we need a query identifier (qid), a position (corresponding to the order of results that are presented to a guest), some values for our x’s, and a label indicating the guest’s action (which is a proxy for relevance). In practice, this data comes from a set of server logs that report what was shown to a guest, and how they interacted.

The following python code will generate a dataset with an arbitrary number of examples per query (NUM_DOCS) and queries (NUM_QUERIES):

from lightgbm import LGBMRanker, plot_tree
import pandas as pd
import
numpy as np
NUM_QUERIES = 100000
NUM_DOCS = 3
def less_than_one():
return np.random.random()
def grtr_than_one():
return np.random.random() + 1.0
def generate_data(seed=42):
if seed:
np.random.seed(seed)
data = []
for i in range(NUM_QUERIES):
cat_val = np.random.randint(0, 2) # same for every item
click_pos = np.random.randint(1, NUM_DOCS+1)
# generate features
for j in range(1, NUM_DOCS+1):
if j == click_pos:
if cat_val == 0:
data.append([
i, j, 1,
less_than_one(),
grtr_than_one(),
cat_val
])
else:
data.append([
i, j, 1,
grtr_than_one(),
less_than_one(),
cat_val
])
else:
if cat_val == 0:
data.append([
i, j, 0,
grtr_than_one(),
less_than_one(),
cat_val
])
else:
data.append([
i, j, 0,
less_than_one(),
grtr_than_one(),
cat_val
])
return data
tr_data = generate_data()
df_train = pd.DataFrame(tr_data, columns=[“qid”, “pos”, “y”, “x_1”, “x_2”, “x_3”])

Running this code, you should generate a dataset looking something like this:

A picture showing the dataset result
A bona fide “learning to rank” dataset: query ID (qid) allows grouping by query, pos labels the position as presented to the guest, and y is an indicator of a click event. For example: The first query (qid = 0) presented a guest with three items, and the guest clicked on the item in position 2.

We will generate four test examples:

# Test data: cover all four cases described above
# Expect: higher scores are rows 2, 3
te_data = [
# x1 x2 x3 expected action
[0.77, 1.33, 1], # no click
[0.61, 1.72, 0], # click
[2.01, 0.01, 1], # click
[1.43, 0.77, 0] # no click
]
df_test = pd.DataFrame(te_data, columns=[“x_1”, “x_2”, “x_3”])

If the model we train scores examples 2 and 3 higher than 1 and 4, we will declare victory. We don’t actually care what the scores are, we only care that sorting by the score descending yields a list with the best things (2 and 3) at the top.

Modeling

The two parameters we care about for this example are the number of trees (this is the number of gradient boosting rounds) and the number of splits per tree.

_X = df_train[[“x_1”, “x_2”, “x_3”]]
_qids = df_train[“qid”]
_grps = df_train.groupby(“qid”).count()[“pos”]
_y = df_train[“y”]
NUM_TREES = 1
NUM_SPLITS = 1
ranker = (
LGBMRanker(min_child_samples=1,
n_estimators=NUM_TREES,
max_depth=NUM_SPLITS)
.fit(X=_X, y=_y, group=_grps,
categorical_feature=[“x_3”])
)

Now, how did we do?

ranker.predict(df_test)
>>> array([ 0.00011688, 0.00011688, -0.00287402, 0.00011688])

Recall, above, by construction we expect the two middle examples to have the highest scores, so our predictions are actually pretty lousy. What gives? Luckily we’re dealing with simple enough trees that we can look at them:

A picture showing the incorrect decision

The decision tree didn’t even get the decision boundary correct with the one feature it picked up. This result is resilient when changing the seed or using larger or smaller data sets. The model simply doesn’t have enough complexity to fit the data because, with only one split, it cannot possibly model the interaction.

Improving the model: Changing the number of splits

A single split obviously can’t capture the complexity of the ground truth rules. The decision tree that we’re trying to model contains two decisions, so naively we might assume that setting NUM_SPLITS to 2 would be sufficient.

A picture showing two splits in the decision tree
Two splits is not enough to capture the correct dynamics, even though it should be.

This doesn’t quite work! We see that the third example has the highest score (nice!), but the second and fourth examples are tied, so we’re not there yet:

>>> array([-0.00680034, 0.00642991, 0.2 , 0.00642991])

I believe this is due to the way that LGBM generates splits because you can change the structure of the trees (but not the performance on test data) by adding more or less data, or changing the random seed. If anyone else has insight here, please comment below!

Two splits isn’t enough, so let’s try a third split. This is more complexity than what is theoretically needed, but the model now scores all documents correctly. Interestingly, the model has learned it only needs one of the two continuous variables (x1 and x2). It also learns (eventually!) the correct decision boundaries:

A picture of a decision tree with 3 splits
The model with three splits eventually learns the correct decision boundaries. The first split is spurious, but does not generate a different decision boundary.

When we score our four test examples, we see that the second and third have the highest scores, meaning they’ll be ranked at the top of the list. If you’re calculating ranking metrics, you should find that they are all maximized when the examples are ordered such that 2 and 3 appear before either of 1 or 4.

This simple model captures the correct interaction between our categorical (x3) and one of the continuous features (x2), and it obviates the collinear x1. So…victory!

What about adding trees?

Let’s try changing the number of trees (LGBM calls them “estimators”). The first case to check is NUM_TREES = 2 and NUM_SPLITS = 1. I won’t post the picture here, but you’ll see that the model gets the decision boundary for the continuous variable correct and the predictions are wrong!

>>> array([ 0.01926256, 0.01926256, -0.0140477 , -0.01962937])

We’re failing to capture the interaction between the continuous variables and the categorical variable and our predictions reflect that — the model simply does not have the complexity to capture this interaction. Regardless of the number of estimators we introduce, the model never discovers the interaction between x1/x2 and x3.

This is an important point: naively, I assumed that complexity ~ NUM_SPLITS x NUM_TREES, and this is wrong! If the number of splits can’t capture the nonlinear interaction between features, growing your forest isn’t going to help.

The final case we’ll check is NUM_TREES = 2 and NUM_SPLITS = 2. In this case, we get the decision boundary correct, and we correctly capture the interaction with the categorical feature:

A picture of 2 decision trees with 2 splits each

As a result, ranking our test data by their model scores produces the optimal ranked list:

>>> array([-0.20314017, 0.20579197, 0.39647031, -0.19308841])

One thing that’s interesting to note here: The second decision tree is the “correct” one. The minimal model was buried in the actual model we built. This is analogous to what we saw above with NUM_SPLITS = 3 and NUM_TREES = 1 . The correct decision rules were embedded in that tree, but weren’t quite as clear as they are here.

Wrapping up

What have we learned? For me, the first lesson (and one that I learn over and over again!) is the power of simple examples. Limiting the complexity in the data and decision rules allowed us to understand how the algorithm was treating features. To paraphrase Einstein — make things as simple as possible, but no simpler.

Second, giving the model enough complexity is important, and you can’t think of it as NUM_SPLITS x NUM_TREES. In the example above (1 tree, 4 splits), you’ll absolutely observe a different result than if you have 4 trees and 1 split. In fact, if you set the NUM_SPLITS = 1 and NUM_TREES = 4, you’ll get 4 trees that split on the same feature. I think the correct intuition is: splits help you model interactions, adding trees helps you model those interactions more accurately.

As your feature set grows, it’s likely that there are more interactions than you can model in a single tree, and LightGBM samples your features (it doesn’t look at the full set of features for each tree). In that case, you will need to add trees, so that each new tree you add has a chance at discovering a new interaction. In practice, LightGBM has a heuristic that lets it determine the number of splits dynamically. This is what we used in our winning model, regularizing with the min_child_samplesparameter until we had acceptable offline performance.

Third, if we set NUM_SPLITS = NUM_TREES = 2, the first tree doesn’t seem to change anything. The second tree is absolutely correct (given enough data). I’m not sure if this speaks to optimum values of NUM_SPLITS and NUM_TREES, but it’s worth calling out that we found two models that performed equally well on our test data. It’s also interesting to point out that the model had to be slightly more complex than the system it was modeling. Perhaps there’s a way to get LightGBM to split correctly the first time — there are parameters that allow you to force splits on features (forced_splits) and penalize splits (feature_penalty) that allow you to communicate your knowledge (or, more likely, your opinion!) to the model.

We didn’t look at overfitting in this example because our dataset had zero noise. It would be an interesting follow-up to build an example to see how overfitting works — I suspect that NUM_TREES and min_child_samples will be more important in that case.

--

--