Building Airbnb Categories with ML and Human-in-the-Loop

By: Mihajlo Grbovic, Ying Xiao, Pratiksha Kadam, Aaron Yin, Pei Xiong, Dillon Davis, Aditya Mukherji, Kedar Bellare, Haowei Zhang, Shukun Yang, Chen Qian, Sebastien Dubois, Nate Ney, James Furnary, Mark Giangreco, Nate Rosenthal, Cole Baker, Bill Ulammandakh, Sid Reddy, Egor Pakhomov

Figure 1. Browsing listings by categories: Castles, Desert, Design, Beach & Countryside

25 Years of Online Travel Search

Online travel search hasn’t changed much in the last 25 years. The traveler enters her destination, dates, and the number of guests into a search interface, which dutifully returns a list of options that best meet the criteria. Eventually, Airbnb and other travel sites made improvements to allow for better filtering, ranking, personalization and, more recently, to display results slightly outside of the specified search parameters–for example, by accommodating flexible dates or by suggesting nearby locations. Taking a page from the travel agency model, these websites also built more “inspirational” browsing experiences that recommend popular destinations, showcasing these destinations with captivating imagery and inventory (think digital “catalog”).

Figure 2. Airbnb Destination Recommendation Example

The biggest shortcoming of these approaches is that the traveler must have a specific destination in mind. Even travelers who are flexible get funneled to a similar set of well-known destinations, reinforcing the cycle of mass tourism.

Introducing Airbnb Categories

In our recent release, we flipped the travel search experience on its head by having the inventory dictate the destinations, not the other way around. In this way, we sought to inspire the traveler to book unique stays in places they might not think to search for. By leading with our unique places to stay, grouped together into cohesive “categories”, we inspired our guests to find some incredible places to stay off the beaten path.

Figure 3. Unique travel worthy inventory in lesser known destinations that users are unlikely to search for

Though our goal was an intuitive browsing experience, it required considerable work behind the scenes to pull this off. In this three-part series, we will pull back the curtain on the technical aspects of the Airbnb 2022 Summer Launch.

Part I (this post) is designed to be a high-level introductory post about how we applied machine learning to build out the listing collections and to solve different tasks related to the browsing experience–specifically, quality estimation, photo selection and ranking.
Part II of the series focuses on ML Categorization of listings into categories. It explains the approach in more detail, including signals and labels that we used, tradeoffs we made, and how we set up a human-in-the-loop feedback system.
Part III focuses on ML Ranking of Categories depending on the search query. For example, we taught the model to show the Skiing category first for an Aspen, Colorado query versus Beach/Surfing for a Los Angeles query. That post will also cover our approach for ML Ranking of listings within each category.

Grouping Listings into Categories

Airbnb has thousands of very unique, high quality listings, many of which received design and architecture awards or have been featured in travel magazines or movies. However, these listings are sometimes hard to discover because they are in a little-known town or because they are not ranked highly enough by the search algorithm, which optimizes for bookings. While these unique listings may not always be as bookable as others due to lower availability or higher price, they are great for inspiration and for helping guests discover hidden destinations where they may end up booking a stay influenced by the category.

To showcase these special listings we decided to group them into collections of homes organized by what makes them unique. The result was Airbnb Categories, collections of homes revolving around some common themes including the following:

Categories that revolve around a location or a place of interest (POI) such as Coastal, Lake, National Parks, Countryside, Tropical, Arctic, Desert, Islands, etc.
Categories that revolve around an activity such as Skiing, Surfing, Golfing, Camping, Wine tasting, Scuba, etc.
Categories that revolve around a home type such as Barns, Castles, Windmills, Houseboats, Cabins, Caves, Historical, etc.
Categories that revolve around a home amenity such as Amazing Pools, Chef’s Kitchen, Grand Pianos, Creative Spaces, etc.

We defined 56 categories and outlined the definition for each category. Now all that was left to do was to assign our entire catalog of listings to categories.

With the Summer launch just a few months away, we knew that we could not manually curate all the categories, as it would be very time consuming and costly. We also knew that we could not generate all the categories in a rule-based manner, as this approach would not be accurate enough. Finally, we knew we could not produce an accurate ML categorization model without a training set of human-generated labels. Given all of these limitations, we decided to combine the accuracy of human review with the scale of ML models to create a human-in-the-loop system for listing categorization and display.

Rule-Based Candidate Generation

Before we could build a trained ML model for assigning listings to categories, we had to rely on various listing- and geo-based signals to generate the initial set of candidates. We named this technique weighted sum of indicators*.* It consists of building out a set of signals (indicators) that associate a listing with a specific category. The more indicators the listing has, the better the chances of it belonging to that category.

Figure 4. Rule-based weighted sum of indicators approach to produce candidates for human review

For example, let’s consider a listing that is within 100 meters of a Lake POI, with keyword “lakefront” mentioned in listing title and guest reviews, lake views appearing in listing photos and several kayaking activities nearby. All this information together strongly indicates that the listing belongs to the Lakefront category. The weighted sum of these indicators totals to a high score, which means that this listing-category pair would be a strong candidate for human review. If a rule-based candidate generation created a large set of candidates we would use this score to prioritize listings for human review to maximize the initial yield.

Human Review

The manual review of candidates consists of several tasks. Given a listing candidate for a particular category or several categories, an agent would:

Confirm/reject the category or categories assigned to the listing by comparing it to the category definition.
Pick the photo that best represents the category. Listings can belong to multiple categories, so it is sometimes appropriate to pick a different photo to serve as the cover image for different categories.
Determine the quality tier of the selected photo. Specifically, we defined four quality tiers: Most Inspiring, High Quality, Acceptable Quality, and Low Quality. We use this information to rank the higher quality listings near the top of the results to achieve the “wow” effect with prospective guests.
Some of the categories rely on signals related to Places of Interest (POIs) data such as the locations of lakes or national parks, so the reviewers could add a POI that we were missing in our database.

Candidate Expansion

Although the rule-based approach can generate many candidates for some categories, for others (e.g., Creative Spaces, Amazing Views) it may produce only a limited set of listings. In those cases, we turn to candidate expansion. One such technique leverages pre-trained listing embeddings. Once a human reviewer confirms that a listing belongs to a particular category, we can find similar listings via cosine similarity. Very often the 10 nearest neighbors are good candidates for the same category and can be sent for human review. We detailed one of the embedding approaches in our previous blog post and have developed new ones since then.

Figure 5. Listing similarity via embeddings can help find more listings that are from the same category

Other expansion techniques include keyword expansions, POI data expansions, etc.

Training ML Models

Once we collected enough human-generated labels, we trained a binary classification model that predicts whether or not a listing belongs to a specific category. We then used a holdout set to evaluate performance of the model using a precision-recall (PR) curve. Our goal here was to evaluate if the model was good enough to send highly confident listings directly to production.

Figure 6 shows a trained ML model for the Lakefront category. On the left we can see the feature importance graph, indicating which signals contribute most to the decision of whether or not a listing belongs to the Lakefront category. On the right we can see the hold out set PR curve of different model versions.

Figure 6. Lakefront ML model feature importance and performance evaluation

Sending confident listings to production: using a PR curve we can set a threshold that achieves 90% precision on a downsampled hold out set that mimics the true listing distribution. Then we can score all unlabeled listings and send ones above that threshold to production, with the expectation of 90% accuracy. In this particular case, we can achieve 76% recall at 90% precision, meaning that with this technique we can expect to capture 76% of the true Lakefront listings in production.

Figure 7. Basic ML + Human in the Loop setup for tagging listings with categories

Selecting listings for human review: given the expectation of 76% recall, to cover the rest of the Lakefront listings we also need to send listings below the threshold for human evaluation. When prioritizing the below-threshold listings, we considered the photo quality score for the listing and the current coverage of the category to which the listing was tagged, among other factors. Once a human reviewer confirmed a listing’s category assignment, that tag would be made available to production. Concurrently, we send the tags back to our ML models for retraining, so that the models improve over time.

ML models for quality estimation and photo selection. In addition to the ML Categorization models described above, we also trained a Quality ML model that assigns one of the four quality tiers to the listing, as well as a Vision Transformer Cover Image ML model that chooses the listing photo that best represents the category. In the current implementation the Cover Image ML model takes the category information as the input signal, while the Quality ML model is a global model for all categories. The three ML models work together to assign category, quality and cover photo. Listings with these assigned attributes are sent directly into production under certain circumstances and also queued for review.

Figure 8. Human vs. ML flow to production

Two New Ranking Algorithms

The Airbnb Summer release introduced categories both to homepage (Figure 9 left), where we show categories that are popular near you, and to location searches (Figure 9 right), where we show categories that are related to the searched destination. For example, in the case of a Lake Tahoe location search we show Skiing, Cabins, Lakefront, Lake House, etc., and Skiing should be shown first if searching in winter.

In both cases, this created a need for two new ranking algorithms:

Category ranking (green arrow in Figure 9 left): How to rank categories from left to right, by taking into account user origin, season, category popularity, inventory, bookings and user interests
Listing Ranking (blue arrow in Figure 9 left): given all the listings assigned to the category, rank them from top to bottom by taking into account assigned listing quality tier and whether a given listing was sent to production by humans or by ML models.

Figure 9. Listing Ranking Logic for Homepage and Location Category Experience

Putting it all together

To summarize, we presented how we create categories from scratch, first using rules that rely on listing signals and POIs and then with ML with humans in the loop to constantly improve the category. Figure 10 describes the end-to-end flow as it exists today.

Figure 9: Logic for Category Creation and Improvement over time

Our approach was to define an acceptable delivery; prototype several categories to acceptable level; scale the rest of the categories to the same level; revisit the acceptable delivery and improve the product over time.

In Part II, we’ll explain in greater detail the models that categorize listings into categories.

Acknowledgments

We would like to thank everyone involved in the project. Building Airbnb Categories holds a special place in our careers as one of those rare projects where people with different backgrounds and roles came together to work jointly to build something unique.

Interested in working at Airbnb? Check out our open roles here.