Airbnb’s AI-powered photo tour using Vision Transformer

By: Pei Xiong, Aaron Yin, Jian Zhang, Lifan Yang, Lu Zhang, Dean Chen

Introduction

In recent years, the integration of artificial intelligence with travel platforms has transformed how people search for and book accommodations. As a leading global marketplace for unique travel experiences and accommodations, Airbnb constantly strives to enhance the guest experience by providing informative content about the variety of homes shared by our hosts. One of the ways we help guests better understand what a listing offers before they book is through our AI-powered photo tour feature.

The AI-powered photo tour in the Listings tab, which helps hosts better organize their listing photos, leverages vision transformers’ fine-tuned feature to assess a diverse set of listing images and accurately identify and classify photos based into specific rooms and spaces. In this blog post, we will dive into the inner workings of the photo tour including model selection, pretraining, fine-tuning techniques, and the trade-offs between computational costs and scalability. We will also specifically discuss how we enhanced model accuracy despite having limited training data.

Figure 1: Photo Tour product powered by ML

Methodology

Room Classification

Room-type classification is the first aspect of the photo tour, The goal of room classification is to accurately categorize images into 16 different room types designed in the Airbnb product such as ‘Bedroom’, ‘Full bathroom’, ‘Half bathroom’, ‘Living room’, and ‘Kitchen’, providing users with a comprehensive understanding of the available spaces. The challenge lies in the diversity of room layouts, lighting conditions, and the need for models that can generalize well across various environments.

We conducted experiments using several state-of-the-art models, including Vision Transformer (ViT) variants — ViT-base, ViT-large and different resolutions. Additionally, we explored the performance of ConvNext2, a recently proposed convolutional neural network with comparable performance to ViT, and MaxVit, a variant combining the strengths of both Vision Transformers and CNNs. At the beginning of this project, we tested these approaches on an image classification task with Airbnb’s host-provided data, and found that ViT outperforms the other approaches. Thus we chose ViT in our following studies.

Image Similarity

Another key component of photo tour is image clustering, which groups the images of the same room into a cluster. A prerequisite of that is the ability to measure the similarity between two images, which indicates the probability that the two images belong to the same room. This is a supervised classification problem, with the input being two images, and the output being a binary label of 0 or 1. As shown in Figure 2, We employed a Siamese network that simultaneously processes two images, by applying the same image embedding model to each image, and subsequently computing the cosine similarity of the resulting embeddings.

Figure 2: An illustration of Siamese network for image similarity

Accuracy Improvement

Our analysis found that the volume of training data is key to higher prediction accuracy. Doubling the training data volume typically leads to a reduction of error rate of ≈5% on average, with the effect being more significant in the earlier stages.

Figure 3: correlation between data volume and accuracy

Unfortunately, it is very expensive to acquire high-quality training data as it requires human labeling. Therefore, we needed to find other ways to improve model accuracy with a limited amount of training data. We followed these steps to improve model accuracy:

Step 1 — Pre-training: We started from a pre-trained model on ImageNet. We took that model and trained it with a large amount of host-provided data, which has lower accuracy and only covers some of our class labels. This provided a baseline model for transfer learning in the following steps.

Step 2 — Multi-task training: We fine-tuned the model from the previous step using both higher-accuracy training data for the target task (e.g., room-type classification), and an additional type of training data that has been labeled for another related task (e.g., object detection). This provided additional training data and created multiple different models for future steps.

Step 3 — Ensemble learning: We created an ensemble from multiple models in Step 2, which was achieved through training with different auxiliary tasks, and by using different versions of ViTs (e.g., ViT-base vs. ViT-large, and/or those consuming images of size 224 vs 384). This approach allowed us to generate a diverse set of models, from which we selected the best performers to construct the final ensemble model.

Step 4 — Distillation: Although the ensemble model has higher accuracy than any individual model, it requires more computational resources and thus increases the latency and cost of our product. We trained a distilled model to imitate the behavior of the ensemble model, which has similar accuracy but reduced computational cost by several folds.

Pre-training and Traditional Fine-tuning

Our pretraining process involved harnessing the vast repository of Airbnb listing photos, comprising of millions of images, to train a Vision Transformer (ViT) model. While leveraging the Airbnb listing photos for pretraining provides a substantial advantage, there are also limitations in the dataset. There were inaccuracies or mislabels in the human-labeled dataset and they materially impacted the model’s ability to discern patterns effectively. Another notable limitation is the coverage of only four out of the total 16 room classifications within the pre-training dataset.

Therefore, expanding the coverage of fine-tuning to include additional classes is imperative. We developed a detailed and updated guideline and generated a human-label dataset with the entirety of 16 room classifications. Iterative fine-tuning processes gradually encompassed the entirety of the 16 room types, contributing to a more comprehensive and versatile model.

Multi-task Learning

Acquiring high-quality human-labeled training data is a challenge due to the costly and time-consuming labeling process. Despite this, we had already accumulated a large repository of labeled data across other various tasks, including room-type classification, image quality prediction, same-room classification, category classification, and object detection. By fully utilizing this extensive and diversely labeled dataset, we significantly improved the prediction accuracy in our tasks. To achieve this, we implemented multi-task training that incorporates additional label classes from existing tasks, as demonstrated in Figure 4. Each learner is a vision transformer, and in addition to predicting a single set of labels, we allowed different learners to learn other label types, such as amenities and ImageNet21k labels, which further boosts overall performance as shown in Table 1.

Figure 4: Multi-task learning illustration

Ensemble Learning

Ensemble learning is a powerful technique in machine learning that leverages diverse models with similar accuracies to achieve better accuracy and generalization.

We applied ensemble learning on diverse models with different architectures, model sizes, and auxiliary tasks such as amenities and ImageNet21k class predictions. Upon aggregating the predictions of the individual models, we observed a notable increase in the overall accuracy compared to any single model. The observed improvement is credited to the ensemble’s capability to address and reduce both misclassifications and inaccuracies of individual models, leading to more accurate predictions, despite the limited human-labeled training data.

Knowledge Distillation

While ensemble learning offers substantial gains in accuracy, it requires heightened computational resources as multiple large models are involved in each inference task. To prioritize model efficiency without compromising performance, we turned to knowledge distillation, a technique centered around transferring knowledge from a sophisticated ensemble of models to a more compact single model.

Our distillation process transfers the knowledge encoded in both hard targets and the soft targets of a complex ensemble to a smaller and simpler model. Hard targets are ground-truth labels while the soft targets are the ensemble’s probabilistic predictions, enabling the smaller model to capture the nuanced decision boundaries learned by the ensemble. The overall training objective is a weighted combination of the two losses:

where the first loss is the cross-entropy loss based on hard targets, the second loss is Kullback-Leibler divergence to evaluate the cross entropy between soft targets from the ensemble and the predictions of the student model, and the distillation coefficient determines the weight assigned to the distillation loss.

Remarkably, our distilled model achieved performance metrics on par with the ensemble models, despite its significantly reduced inference time and resource requirements. This outcome demonstrates the efficacy of knowledge distillation in preserving the ensemble’s collective intelligence within a more streamlined model.

Golden Evaluation

As part of the preparations for the launch of our end-to-end Photo Tour, we employed a rigorous evaluation process called “Golden Evaluation”, which mimics the actual user experience by calculating the minimum number of changes required to make the Photo Tour generated by our model identical to the human-labeled ground truth (i.e., the Golden Evaluation). In contrast to training data that is evenly distributed across classes, the golden evaluation processes at the Airbnb listing level, aiming to replicate the user’s perspective. We sampled listings, each containing an average of 25–30 photos, and defined accuracy as the minimum number of corrections required to make assignments consistent with human labels. These corrections refer to changes in room assignment, where a photo’s initial room prediction is modified to match the consensus room label provided by multiple human labels. For example, if a photo of bedroom 1 is falsely assigned to the living room, one correction is required to move it from the living room to bedroom 1.

There are photos that cannot be properly assigned to a named space. We classified miscellaneous photos, including close-up shots, images containing humans or animals, as well as nearby photos of shopping areas, restaurants, and parks, into the category labeled as “Others”. Furthermore, if a photo is of an empty space in a room such that we cannot judge its room location, we are allowed to designate some photos as “Unassigned”, which do not count in the accuracy calculation. This scenario occurs infrequently (as shown in Table 3), and is primarily used to let users decide in the most ambiguous cases. This evaluation served as the final launch criteria. Ultimately, we successfully reduced the error rate to 5.28%, passing the internal evaluation standard at Airbnb and Photo Tour was launched as a showcase feature in the November 2023 product launch.

Conclusion

Our exploration of using Vision Transformers to improve our photo tour product has been successful and rewarding. By incorporating pretraining, multi-task learning, ensemble learning, and knowledge distillation, we’ve significantly enhanced model accuracy. Pretraining provided a strong foundation, while multi-task learning enriched the model’s ability to interpret diverse visuals. Ensemble learning combined model strengths for robust predictions, and knowledge distillation enabled efficient deployment without sacrificing accuracy.

The AI-powered photo tour was launched as part of Airbnb’s 2023 Winter Release. Since then, we have been diligently monitoring the performance of this product and continue to refine our models further for an even more seamless user experience.

Acknowledgments

We would like to thank everyone involved in the project. A special thanks to the entire Airbnb user, listing, and platform team for their relentless efforts in developing and launching the product, ensuring its continued excellence. Additionally, we extend our gratitude to the Airbnb Machine Learning Infra team for their crucial support in building a robust infrastructure that photo tour relies upon.

If this type of work interests you, check out some of our related roles!