Visual Discovery at Pinterest
如果无法正常显示,请先停止浏览器的去广告插件。
1. Visual Discovery at Pinterest
1 ∗
1 ∗
1 ∗
1
Andrew Zhai , Dmitry Kislyuk , Yushi Jing , Michael Feng
12
12
1
2
Eric Tzeng , Jeff Donahue , Yue Li Du , Trevor Darrell
1
Visual Discovery, Pinterest
2
University of California, Berkeley
{andrew,dkislyuk,jing,m,etzeng,jdonahue,shirleydu}@pinterest.com
trevor@eecs.berkeley.edu
ABSTRACT
Over the past three years Pinterest has experimented with
several visual search and recommendation services, includ-
ing Related Pins (2014), Similar Looks (2015), Flashlight
(2016) and Lens (2017). This paper presents an overview
of our visual discovery engine powering these services, and
shares the rationales behind our technical and product de-
cisions such as the use of object detection and interactive
user interfaces. We conclude that this visual discovery en-
gine significantly improves engagement in both search and
recommendation tasks.
Keywords
visual search, recommendation systems, convnets, object de-
tection
1.
INTRODUCTION
Visual search and recommendations [5], collectively re-
ferred to as visual discovery in this paper, is a growing re-
search area driven by the explosive growth of online pho-
tos and videos. Commercial visual search systems such as
Google Goggles and Amazon Flow are designed to retrieve
photos with the exact same object instance as the query im-
age. On the other hand, recommendation systems such as
those deployed by Google Similar Images [37], Shopping [4]
and Image Swirl [18] display a set of visually similar pho-
tos alongside the query image without the user making an
explicit query.
At Pinterest we experimented with different types of vi-
sual discovery systems over a period of three and half years.
We benefited from the confluence of two recent developments
– first, advances in computer vision, especially the use of
convolutional networks and GPU acceleration, have led to
significant improvements in tasks such as image classifica-
tion and object detection. Second, a substantial number
of users prefer using discovery systems to browse (finding
∗
indicates equal contribution.
c 2017 International World Wide Web Conference Committee
(IW3C2), published under Creative Commons CC BY 4.0 License.
WWW’17 Companion, April 3–7, 2017, Perth, Australia.
ACM X-XXXXX-XX-X/XX/XX.
Figure 1: Pinterest Flashlight: User can select any
objects in the image (e.g. lamp, desk, shelf ) as a
visual search query.
Figure 2: If objects are automatically detected,
Flashlight displays a clickable “dot” for faster navi-
gation.
inspirational or related content) rather than to search (find-
ing answers). With hundreds of millions of users (who are
often browsing for ideas in fashion, travel, interior design,
recipes, etc.), Pinterest is a unique platform to experiment
with various types of visual discovery experiences.
Our previous work [17] presented preliminary results show-
ing that convnet features and object detection can be used
effectively to improve user engagement in visual search sys-
tems. We subsequently launched Pinterest Flashlight [41],
an interactive visual discovery tool that allows user to select
any object in the image (e.g. lamp, desk, shelf) as visual
queries, as shown in Figure 1. If the system is certain about
the location of an object, either through automatic object
detection [10] or collaborative filtering, Flashlight displays a
2. Figure 3: By indexing objects instead of whole im-
ages in our visual search system, Pinterest Lens find-
ings objects within the images.
clickable “dot” on the image for faster navigation, as shown
in Figure 2. Recently we launched Pinterest Lens [40], a vi-
sual discovery experience accessed via a mobile phone cam-
era, as shown in Figure 3. In this release we also applied
object detection to the billions of images on Pinterest so
that query objects can be matched against other visually
similar objects within the catalog images.
This paper gives a general architectural overview of our vi-
sual discovery system and shares the lessons we learned from
scaling and integrating visual features into products at Pin-
terest. For example, we investigate the performance of pop-
ular classification networks for retrieval tasks and evaluate
a binarization transformation [1] to improve the retrieval
quality and efficiency of these classification features. This
feature transformation enables us to drastically reduce the
retrieval latency while improving the relevance of the results
in large-scale settings. We also describe how we apply object
detection to multiple visual discovery experiences including
how to use detection as a feature in a general image recom-
mendation system and for query normalization in a visual
search system.
The rest of the paper is organized as follows: Section 2
and 3 give a brief introduction on the Pinterest image collec-
tion essential to our visual discovery experience and survey
related works. Section 4 and 5 present the visual features
and object detectors used in our system. Section 6 describes
how visual features can be used to enhance a generic image
recommendation system. Section 7 and 8 presents our expe-
rience launching Pinterest Flashlight and our latest applica-
tion, Pinterest Lens.
2.
PINTEREST IMAGES
Our work benefited significantly from having access to bil-
lions of catalog photos. On Pinterest, hundreds of millions
of users organize images and videos around particular top-
ics into boards as shown in Figure 4, which result in a very
large-scale and hand-curated collection with a rich set of
metadata. Most Pin images are well annotated: when a
person bookmarks an image on to a board, a Pin is cre-
ated around an image and a brief text description supplied
by the user. When the same Pin is subsequently saved to
a new board by a different user, the original Pin gains ad-
ditional metadata that the new user provides. This data
Figure 4: Pinterest users collect images into vari-
eties of themed boards.
structure continues to expand every time an image is shared
and saved. Such richly annotated datasets gave us the abil-
ity to generate training sets that can potentially scale to
hundreds of thousands of object categories on Pinterest.
3.
RELATED WORK
Visual discovery systems: Amazon Flow, Google and
Bing Similar Images are examples of widely used visual search
and recommendation systems. Various works [15] [8] have
been proposed to improve the ranking of the image search
results using visual features. In recent years, research has
also focused on domain-specific image retrieval systems such
as fashion recommendation [38] [23] [31], product recommen-
dation [21] [19] and discovery of food images [2]. Compared
with existing commercial visual discovery engines, our sys-
tem focuses more on interactive retrieval of objects within
the images.
Representation learning: Over the past few years, con-
vnet architectures such as AlexNet [20], GoogLeNet [34],
VGG and ResNet have continuously pushed the state of the
art on large-scale image classification challenges. Though
trained for classification in one particular domain, the vi-
sual features extracted by these models perform well when
transferred to other classification tasks [29], as well as re-
lated localization tasks like object detection and semantic
segmentation [24]. This paper presents empirical evaluation
of widely used classification-based image features in retrieval
settings. 1
Detection: For web-scale services, the part-based model [7]
approach was a well studied for detection, but recently deep
learning methods have become more prevalent, with appli-
cations such as face detection [36], street number detec-
tion [11], and text detection [13]. Recent research focuses
on application of detection architectures such as Faster R-
CNN [27], YOLO [42] and the Single Shot Detector [22]. In
this work we present, to the best of our knowledge, the first
end-to-end object detection system for large-scale visual dis-
covery systems.
1
Recent work has also demonstrated the effectiveness of
learning embeddings or distance functions directly from
ranking labels such as relative comparisons [28] [16] and
variations of pairwise comparisons [3] [33], or using Bilin-
ear features [9] for fine-grained recognition. We will report
our empirical findings in our future work.
3. Table 1:
dataset.
Model
AlexNet
AlexNet
AlexNet
AlexNet
AlexNet
AlexNet
GoogleNet
GoogleNet
VGG16
VGG16
VGG16
VGG16
VGG16
VGG16
ResNet101
ResNet101
ResNet101
ResNet101
ResNet152
ResNet152
ResNet152
ResNet152
AlexNet
AlexNet
AlexNet
VGG16
VGG16
VGG16
ResNet101
ResNet101
ResNet152
ResNet152
VGG16 (Pin.)
4.
Precision@K on Pinterest evaluation
Layer
fc6
fc6
fc7
fc7
fc8
fc8
loss3/classifier
loss3/classifier
fc6
fc6
fc7
fc7
fc8
fc8
pool5
pool5
fc1000
fc1000
pool5
pool5
fc1000
fc1000
fc6
fc7
fc8
fc6
fc7
fc8
fc1000
pool5
fc1000
pool5
fc6
Type
raw
raw
raw
raw
raw
raw
raw
raw
raw
raw
raw
raw
raw
raw
raw
raw
raw
raw
raw
raw
raw
raw
binary
binary
binary
binary
binary
binary
binary
binary
binary
binary
binary
Dist.
L2
L1
L2
L1
L2
L1
L2
L1
L2
L1
L2
L1
L2
L1
L2
L1
L2
L1
L2
L1
L2
L1
H.
H.
H.
H.
H.
H.
H.
H.
H.
H.
H.
P@1
0.093
0.099
0.104
0.106
0.088
0.090
0.095
0.098
0.108
0.118
0.116
0.113
0.104
0.106
0.160
0.149
0.133
0.139
0.170
0.152
0.149
0.148
0.129
0.110
0.089
0.158
0.133
0.110
0.125
0.055
0.133
0.057
0.169
P@5
0.045
0.047
0.052
0.052
0.047
0.046
0.050
0.050
0.051
0.057
0.058
0.060
0.054
0.054
0.080
0.073
0.068
0.067
0.083
0.077
0.073
0.073
0.065
0.054
0.046
0.081
0.068
0.055
0.062
0.025
0.065
0.026
0.089
P@10
0.027
0.027
0.031
0.031
0.028
0.028
0.032
0.032
0.030
0.035
0.036
0.038
0.034
0.034
0.050
0.045
0.042
0.041
0.050
0.047
0.045
0.044
0.039
0.033
0.027
0.049
0.044
0.035
0.039
0.014
0.041
0.015
0.056
FEATURE REPRESENTATION
We adopted and evaluated several popular classification
models such as AlexNet[20], GoogLeNet [35], VGG16 [32],
and variants ResNet101 and ResNet152[12]. In addition to
the raw features, we also explored binarized [1] representa-
tions of these features which are of interest to us due to their
much smaller memory footprint. We also compare the Eu-
clidean (L2) and Manhattan (L1) distance metrics for the
raw features. For these features, training and inference are
done through the open-source Caffe [14] framework on multi-
GPU machines. We are separately investigating on-device
visual discovery experience to augment the server-based ap-
proach using the mobile implementation of TensorFlow [43].
Beyond the base models trained for ImageNet classifi-
cation [6], we also fine-tune on our own Pinterest train-
ing dataset generated in a similar fashion as the evaluation
dataset while ensuring no overlap. This is because Pinterest
images, many of which are high-quality stock photography
and professional product images, have different statistics
than ImageNet images, many of which are personal photos
focused on individual generic objects. Models are fine-tuned
by replacing the softmax classification layer of a pre-trained
base model with a new classification layer, trained to clas-
sify Pinterest images, initialized using the same intermediate
and lower level weights.
The training dataset consists of millions of images dis-
tributed over roughly 20,000 classes while the evaluation
dataset, visualized in Figure 6, consists of 100,000 images
distributed over the same classes. Both datasets were col-
lected from a corpus of Pinterest images that are labeled
with annotations. We limit the set of annotations by taking
the top 100,000 text search queries on Pinterest and ran-
Figure 5: VGG16 fc6 layer raw features (left) vs
binary features (right). PASCAL VOC 2011 clas-
sification images are shown as colored dots. White
noise images are shown as black dots. One can see
that binarization separates out the noise from the
Pascal images while retaining the label clusters.
Table 2: Applying binary transformation to the
VGG16 fc6 feature improves precision@k on PAS-
CAL VOC 2011 images
Type
Dist P@1
P@5
P@10
raw
L2
0.588
0.544
0.506
raw
L1
0.658
0.599
0.566
binary H.
0.752 0.718 0.703
domly sampling 20,000 of them. After filtering, we retain a
corpus of images containing only annotations matching the
sampled queries and randomly sample from this corpus to
create balanced training and evaluation datasets. The query
annotation used to generate the image becomes its class la-
bel.
To evaluate our visual models, from our evaluation dataset
we use 2,000 images as query images while the rest are in-
dexed by our visual search system using the image represen-
tation being evaluated. We then retrieve a list of results for
each query image, sorted by distance. A visual search result
is assumed to be relevant to a query image if the two images
share the same class label, an approach that is commonly
used for offline evaluation of visual search systems [26] in ad-
dition to human evaluation. From the list of results with the
associated class labels, we compute Precision @ K metrics
to evaluate the performance of the visual models.
The results of these experiments for P@1, P@5, and P@10
performance are shown in Table 1. We see that when using
raw features, intermediate features (pool5, fc6, fc7) result
in better retrieval performance than more semantic features
(fc8, fc1000). Among the raw feature results, ResNet152
pool5 features using L2 distance perform the best. For scal-
ability, however, we are most interested in performance us-
ing binary features as our system must scale to billions of
images. When binarized, VGG16 fc6 features perform best
among the features we tested. These binarized fc6 features
are 16x smaller than the raw pool5 features, with a trade-off
in performance. Fine-tuning the VGG16 weights on Pinter-
est data, however, makes up for this performance difference.
We conjecture that the drop in performance of the ResNet
pool5 feature when binarized is due to how the pool5 fea-
tures are average pooled from ReLU activations. ReLU ac-
tivations are either zero or positive and averaging these ac-
tivation will bias towards positive pool5 features that are
each binarized to one. We see that for ResNet101 pool5 fea-
4. lows us to build novel discovery experiences (e.g. Section 8),
but also improves user engagement metrics as detailed in
Sections 6 and 7. This section covers our iterations of ob-
ject detection at Pinterest, starting from deformable parts
models described in our prior work [17] to Faster R-CNN
and Single Shot Detection (SSD) described below.
Faster R-CNN: One approach of interest for object de-
tection is Faster R-CNN, given its state-of-the-art detec-
tion performance, reasonable latency for real-time applica-
tions [27], and favorable scalability with a high number of
categories (since the vast majority of parameters are shared).
We experimented with Faster R-CNN models using both
VGG16 [32] and ResNet101 [12] as base architectures and
the ResNet101 variant is currently one of the detection mod-
els, along with SSD, used in production at Pinterest.
When training Faster R-CNN models, we make use of a
few differences from the original presentation of the method.
First, we train our models using direct end-to-end optimiza-
tion, rather than the alternating optimization presented in
the original paper. Additionally, for the ResNet101 architec-
ture of Faster R-CNN, we found that significant computa-
tion savings could be achieved by decreasing the number of
proposals considered from 300 to 100 (or dynamically thresh-
olded based on the objectness score of the ROI) during infer-
ence without impacting precision or recall. The shared base
convolution features are pre-trained on a fine-tuned Pinter-
est classification task, and then trained for 300k iterations
using an internal Pinterest detection dataset.
Figure 6: Visualization of binarized embeddings of
Pinterest images extracted from fine-tuned VGG16
FC6 layer.
tures on our evaluation dataset, on average 83.86% of the
features are positive. Additionally, we speculate that inter-
mediate features of AlexNet and VGG16 are better suited
for the current binarization scheme. By training the inter-
mediate features with a ReLU layer, the network will learn
features that ignore the magnitude of negative activations.
We see that this is also exactly what the binarization post-
processing does for negative features.
One interesting observation is that binarization improves
the retrieval performance of AlexNet and VGG16. We repro-
duce this behavior in Table 2 on the Pascal VOC 2011 [25]
classification dataset. In this experiment, we built our in-
dex using fc6 features extracted from VGG16 from the train
and validation images of the Pascal classification task. Each
image is labeled with a single class with images that have
multiple conflicting classes removed. We randomly pick 25
images per class to be used as query images and measure
precision@K with K = 1, 5, 10. With the Pascal dataset, we
can also qualitatively see that one advantage of binarization
is to create features that are more robust to noise. We see
this effect in Figure 5 where binarization is able to cleanly
separate noisy images from real images while the raw fea-
tures cluster both together.
5.
OBJECT DETECTION
One feature that is particularly relevant to Pinterest is the
presence of certain object classes (such as shoes, chairs, ta-
bles, bags, etc.). Extracting objects from images not only al-
Single Shot Detection: Although Faster R-CNN is con-
siderably faster than previous methods while achieving state-
of-the-art precision, we found that the latency and GPU
memory requirements were still limiting. For example, the
productionized implementation of Faster R-CNN used in
Flashlight, as described in Section 7, relies on aggressive
caching, coupled with dark reads during model swaps for
cache warming. We also index objects into our visual search
system to power core functionalities of Lens, as described in
Section 8, which requires running our detector over billions
of images. Any speedups we can get in such use cases can
lead to drastic cost savings. As a result, we have also exper-
imented with a variant of the Single Shot Multibox Detector
(SSD) [22], a method known to give strong detection results
in a fraction of the time of Faster R-CNN-based models.
A more comprehensive evaluation on the speed/accuracy
trade-off of object detection is described in [44].
Because speed is one of our primary concerns, we perform
detection with SSD at a relatively small resolution of 290 ×
290 pixels. Our VGG-based architecture resembles that of
the original authors very closely, so we refer the reader to
the original work for more details. Here, we outline a few
key differences between our model and that of the original
authors.
First, we train our models with a 0.7 IoU threshold instead
of a 0.5 threshold, to ensure that the resulting detections
tightly bound the objects in question. The original SSD ar-
chitecture uses additional convolutional layers with strides
of 2 to detect larger objects—however, we found that this led
to poor localization performance under our desired 0.7 IoU
threshold. Instead, our model to uses a stride of 1 for all ad-
ditional convolutions, which greatly improves the quality of
bounding box localization. Due to the size of our detection
5. Figure 7: Sample object detection results from the
Faster R-CNN model (labeled as red) and SSD
model (labeled as blue)
Table 3: Object detection performance.
Faster R-CNN
SSD
precision recall precision recall
Fashion
0.449
0.474
0.473
0.387
Home decor
0.413
0.466
0.515
0.360
0.676
0.625
0.775
0.775
Vehicles
Overall
0.426
0.470
0.502
0.371
Latency
272 ms
59 ms
dataset (74,653 images), we also found that randomly sam-
pling positive and negative anchors in a 1:3 ratio, as in the
original paper, led to poor convergence. Instead, we make
use of the recent Online Hard Example Mining (OHEM)
method to sample anchors by always selecting the anchors
that the models incurs the largest loss for [30]. However,
despite our success with OHEM, we note that OHEM actu-
ally led to overfitting on a smaller detection dataset (19,994
images), indicating that OHEM is perhaps most useful on
large and difficult datasets.
With these changes, we were able to train an SSD-based
model that can perform object detection on 290×290 images
in just 93 ms on an NVIDIA GRID K520 (an older gen-
eration GPU supported by Amazon Web Services), while
achieving an F1 score on our internal evaluation dataset
comparable to the previously described ResNet Faster R-
CNN model.
Table 3 provides precision, recall, and latency figures on
a Pinterest detection evaluation dataset, using our Faster
R-CNN and SSD implementations. This dataset contains
79 categories, broadly categorized by Fashion, Home Decor,
and Vehicles. This evaluation was performed at a 0.7 IoU
threshold, and although the precision results look favorable
for SSD, we did observe, qualitatively, that the localization
quality was worse (one reason being that SSD uses a smaller,
warped input size). Nevertheless, SSD, along with the newer
iteration of the YOLO model, owning to their simpler and
more streamlined architectures (and superior latency), war-
rant close investigation as the favored object detection mod-
els for production applications.
6.
PINTEREST RELATED PINS
Related Pins is a pin recommendation system that lever-
ages the vast amount of human-curated content on Pinter-
est to provide personalized recommendations of pins based
on a given query pin. It is most heavily used on the pin
closeup view shown in Figure 8, which is known as the
Related Pins feed. Additionally, Related Pins recommen-
dations have been incorporated into several other parts of
Figure 8: Related Pins is an item-to-item recom-
mendation system. The results are displayed below
the currently viewing image.
Pinterest, including the home feed, pin pages for unauthen-
ticated visitors, emails, and certain curated pin collections
(such as the Explore tab).
User engagement on Pinterest is defined by the following
actions. A user closeups on a pin by clicking to see more
details about the pin. The user can then click to visit the
associated Web link; if they remain on-site for an extended
period of time, it is considered a long click. Finally, the user
can save pins onto their own boards. We are interested in
driving “Related Pins Save Propensity” which is defined as
the number of users who have saved a Related Pins recom-
mended pin divided by the number of users who have seen
a Related Pins recommended pin. Liu et al. [39] presented
a more detailed architecture overview and evolution of the
related pins feature. This section focuses on how convnet
features and object detection can improve the engagement
of Related Pins.
Covnet features for recommendations
Related Pins are powered through collaborative filtering (via
image-board co-occurrences) for candidate generation and
the commonly used Rank-SVM to learn a function where
various input features are used to generate a score for re-
ranking. To conduct the experiment, we set up a series
of A/B experiments, where we selected five million popular
Pins on Pinterest as queries, and re-ranked their recommen-
dations using different sets of features. The control group
re-ranked Related Pins using a linear model with the exist-
ing set of features. The treatment group re-ranked using
fine-tuned VGG fc6 and fc8 visual similarity features along
with indicator variables (in addition to the features used in
control).
Across the 5M query Pins, the treatment saw a 3.2% in-
crease in engagement (click and repin) 2 . After expanding
the treatment to 100M query Pins, we observed a net gain
of 4.0% in engagement with Related Pins, and subsequently
launched this model into production. Similar experiments
with a fine-tuned AlexNet model yielded worse results (only
2
This metric was measured across a 14 day period in Sep.
2015.
6. 0.8% engagement gain) as expected from our offline evalua-
tion of our visual feature representation.
When broken down by category, we noted that the engage-
ment gain was stronger in predominantly visual categories,
such as art (8.8%), tattoos (8.0%), illustrations (7.9%), and
design (7.7%), and lower in categories which primarily rely
on text, such as quotes (2.0%) and fitness planning (0.2%).
Given the difference in performance among categories, we
performed a follow-up experiment where we introduced a
cross feature between the category vector of the query and
the scalar fc6 visual similarity feature (between the query
and candidate) to capture the dependency between category
and usefulness of the visual features in our linear model.
This introduces 32 new features to the model, one for each of
our site-wide categories (these features are sparse, since the
Pinterest category vector thresholds most values to zero).
The result from this was a further 1.2% engagement increase
in addition to the gains from the initial visual re-ranking
model.
Object detection for recommendation
To validate the effectiveness of our detection system in pro-
duction, one of our first experiments was to further improve
the use of visual features in Related Pins (described in the
previous section). Our primary observation was that users
are sometimes interested in the objects in the Pin’s image,
instead of the full image, and we therefore speculated that
object detection could help compute targeted visual similar-
ity features. For the below experiments, we focus on fashion
queries as only our fine-tuned fashion Faster R-CNN model
was available at the time of this experiment.
sual object however, we increase the weight of the vi-
sual similarity features by a factor of 5, as in variant
B. In this variant, we assume that the presence of de-
tected visual objects such as bags or shoes indicates
that visual similarity is more important for this query.
Results for these variants are listed in Table 4. Variants
A and B of the object detection experiments show that di-
rectly computing visual features from object bounding boxes
slightly increases engagement. One suspicion for the lack of
more significant performance increases may be due to our
detector returning tight bounding boxes that do not pro-
vide enough context for our image representation models as
we are matching a query object to whole image candidates.
Variant C shows a more significant 4.9% engagement gain
over the VGG similarity feature control, demonstrating that
the presence of visual objects alone indicates that visual sim-
ilarity should be weighed more heavily.
Our object detection system underwent significant im-
provements following the promising results of this initial
experiment, and was applied to substantially improve en-
gagement on Flashlight, as described in the next section.
7.
PINTEREST FLASHLIGHT
• Variant B : same as variant A, but we also hand-tune
the ranking model by increasing the weight given to
visual similarity by a factor of 5. The intuition behind
this variant is that when a dominant visual object is
present, visual similarity becomes more important for
recommendation quality. Flashlight, as shown in Figure 9 is a visual search tool that
lets users search for object within any image on Pinterest.
It supports various applications including Shop-the-Look 3 ,
which launched in 2017 and shown in Figure 10.
The input to Flashlight is an object proposal, which is
generated either (1) using the detection features described in
Section 5 or (2) directly from our users via the flexible crop-
ping mechanism as shown in Figure 9. Visual search on the
object is then powered by the retrieval features described in
Section 4 for candidate generation with light weight rerank-
ing using the full image metadata as context. Flashlight
returns both image results and clickable tf-idf weighted an-
notations to allow users to narrow down on results. These
annotations are generated by aggregating the annotations of
the image results.
We implemented Flashlight as an iteration of Similar Looks,
a fashion specific visual search tool described in our prior
work [17]. We learned from Similar Looks that being able
to find similar results of objects within an image can in-
crease engagement. However, after running the experiment,
we saw challenges in launching the Similar Looks product
into wider deployment. First, due to the offline nature of
our older parts-based detection pipeline, the coverage of ob-
jects was too low. Only 2.4% of daily active users saw a de-
tected object and user research revealed that inconsistency
with how some images had objects while other images did
not confused users, who expected interactive dots on every
image, including new content.
The cropping mechanism was our solution to this prob-
lem, giving users the flexibility to manually select any ob-
ject in any image and get real-time visually similar results
for the selected object. By not restricting what a user can
crop, any object can be searched. A few months after the
initial launch, we introduced object detection to Flashlight,
generating clickable dots to simplify the user interface. At
the time of the publication, more than 1 billion instances of
objects were detected and shown to the user.
• Variant C : Both control and treatment have the same
visual features. If the query contains a dominant vi- 3
Shop-the-Look combines visual search results with expert
curation to improve the accuracy of the results.
Table 4: Engagement results when adding cross fea-
tures and object detection to visual similarity fea-
ture, measured over a 7 day period in Oct. 2015.
Features
FT-VGG (control)
FT-VGG + category cross features
FT-VGG + object detection variant A
FT-VGG + object detection variant B
FT-VGG + object detection variant C
Queries
-
5M
315k fashion
315k fashion
315k fashion
Engagement
-
+1.2%
+0.5%
+0.9%
+4.9%
After applying non-maximum suppression (NMS) to the
proposals generated by our Faster R-CNN detector, we con-
sidered query Pins where the largest proposal either occupies
at least 25% of the Pin’s image or if the confidence of the
proposal passes a threshold of 0.9. We categorize these im-
ages as containing a dominant visual object, and using the
best-performing fine-tuned VGG re-ranking variant from the
previous section as our control, we experimented with the
following treatments:
• Variant A: if a dominant visual object is detected in
the query Pin, we compute visual features (VGG fc6)
on just that object.
7. Figure 9: Pinterest Flashlight supports interactive retrieval of objects in the images through the use of
“cropper” tool (left) and query-refinement suggestions (top right).
Table 5: Object Detection in Flashlight (Faster R-
CNN ResNet101), detection threshold 0.7
Visual
Annot.
Categ.
Engagement
Thresh. Thresh. Conformity Gain
0.0
0
0.0
-3.8%
1.0
0
0.0
-5.3%
0.0
1000
0.0
-5.2%
1.0
1000
0.0
+2.1%
1.0
1000
0.8
+4.9%
Figure 10: Shop-the-Look uses a variant of Flash-
light returning only buyable content along with
manual tagging to achieve a high quality visual shop-
ping experience.
Convnet Features for Retrieval
The initial Flashlight launch did not include automatic ob-
ject detection. We introduced a semi-transparent search icon
on the top right of a Pin image that a users can press to use
the cropping mechanism. As the user manipulates the crop
bounding box, real time visual search results and annota-
tions are generated.
Initially after the launch we saw that 2% of users who
viewed a pin closeup will press the search button, which
corresponds to roughly 4 million visual search request per
day. Because search is done on crops of an image, when
retrieving visually similar results, the re-ranking function
normally applied after the candidate retrieval step is very
minimal 4 as we have no metadata for the crop. Therefore,
the visual search results returned are mostly determined by
the initial nearest neighbor retrieval on our deep learning
embeddings. With the context that it is mostly deep learn-
ing that is powering this new feature, we are able to achieve
2/3 the engagement rate of text search, an existing product
whose quality has been in development for many years.
Object Detection for Retrieval
One crucial improvement to Flashlight that we implemented
is real-time object detection using Faster R-CNN, which of-
4
Some components of the re-ranking function such as near-
dupe image scoring still apply
Figure 11: Examples of object detection false posi-
tives (shown as selected object), which are success-
fully suppressed using the category conformity sup-
pression method.
fers several advantages to the product. First, we’re able to
simplify the user interface of Flashlight by placing prede-
termined object dots for categories we detect so users could
click on the dot instead of manually moving a cropping box.
Second, a useful goal for any retrieval system is to construct
an aggregated user engagement signal, one use case being to
boost up known highly engaged results for a query. This ag-
gregation was infeasible in our initial version of Flashlight as
most queries are unique due to the use of a manual cropper.
Though intuitively, object detection seemed to be a simple
engagement improvement to Flashlight as users would have
a much easier interface to use, we learned otherwise through
8. Figure 12: Pinterest Lens: user can take a picture on
their phone camera to get objects and ideas related
to the photo (e.g. a strawberry image can lead to
chocolate strawberry recipes).
our A/B experiments. Upon initial launch of the object de-
tection experiment, where the control displayed a default
bounding box, and treatment displayed a clickable dot over
each detected object (as in Figure 2) we found that en-
gagement metrics decreased (specifically, we were interested
in “Flashlight Save Propensity” similar to the Related Pins
metric used previously). After investigation into the reason
for the poor performance, We saw two significant issues, as
shown in Figure 11: (a) bad object detections (manifested as
either low localization quality or object “hallucination”, left
example), and (b) irrelevant retrieval results, which could ei-
ther be a symptom of the first failure case (right example),
or a standalone issue (middle example).
To address these problems, we developed a method of
ranking object detections with the confidence of the visual
search system, as we found that detector confidence alone
is not sufficient to suppress poor detections. In particular
we looked into three signals returned by our visual search
system: visual Hamming distance of the top result (from
our 4096 dimensional binarized convnet features), top an-
notation score for annotations returned by our visual search
system (aggregated tf-idf scored annotations from the visual
search results), and category conformity (maximum portion
of visual search results that are labeled with the same cat-
egory). We list our results in Table 5, where we impose a
minimum threshold on each signal. Our best variation re-
sults in a 4.9% increase in user engagement on Flashlight.
We found that category conformity in particular was criti-
cal in our improved engagement with Flashlight when object
detection was enabled. A low category conformity score in-
dicates irrelevant visual search results which we use as a
proxy to suppress both of the error types shown in Figure
11.
8.
PINTEREST LENS
Pinterest Lens, as shown in Figure 12, is a new discov-
ery experience accessed via a mobile phone camera. Unlike
Flashlight, Lens is not optimized to return visually similar
results, but instead is implemented to return a diverse set
of engaging results semantically relevant to the query. For
example, a photo of blueberries would yield not only visu-
ally similar blueberries, the results may also include recipes
for various types of blueberry desserts. In this section, we
describe the technology used in Lens, and will report met-
rics in a future publication, as the product launch was very
recent at the time of this writing.
Figure 13: Lens is divided into two components.
The query understanding layer (left) first computes
visual features. The blending layer (right) then
fetches results from multiple content sources.
Figure 14: Given a query object, we find visually
similar objects contained in larger scenes (whole im-
ages) by indexing object embeddings.
The overall Lens architecture is separated into two logi-
cal components as shown in Figure 13. The first component
is the query understanding layer where we returned a va-
riety of visual and semantic features for the input image
such as annotations, objects, and salient colors. The second
component is result blending as Lens results come from mul-
tiple sources. Visually similar results come from Flashlight,
semantically similar results are computed using the annota-
tions to call Pinterest image search, and contextual results
(e.g. living room designs containing the chair in the photo
taken by the user’s camera) come from our object search
system explained below. Not all sources are triggered per
request. The blender will dynamically change blending ra-
tios and content sources based on our derived features from
the query understanding layer. For instance, image search
would not be triggered if our annotations are low confidence.
Object Search
Object search is a visual search system where instead of in-
dexing only whole images as per traditional systems, we also
9. index objects. One use case of such a system is to retrieve
results which contain a queried object. For example, in Fig-
ure 3, if a user takes a picture of a Roman numeral clock,
they may be interested in furniture which complements this
clock or living room designs containing this clock.
To build this system, we use our SSD object detector, as
described in Section 5, to extract objects from billions of
images on Pinterest for indexing. This results in a corpus
with more than a billion objects. SSD is also run during
query time on the given input image as object search is only
triggered if the query image contains an object. Given a
query object, we compute visually similar objects and then
map these objects back to their whole images (scenes) to
return to the user. Figure 14 describes an end-to-end view
of the pipeline. The idea of matching query with objects
was previously explored in [17] [3]; to the best of our knowl-
edge this is the first production system to serve this class of
recommendations.
9.
CONCLUSIONS
This paper presents an overview of our visual discovery
engine powering various visual discovery experiences at Pin-
terest, and shares the rationales behind our technical and
product decisions such as the use of binarized features, ob-
ject detection, and interactive user interfaces. By sharing
our experiences, we hope visual search becomes more widely
incorporated into today’s commercial applications.
10.
ACKNOWLEDGEMENTS
Visual discovery is a collaborative effort at Pinterest. We’d
like to thank Maesen Churchill, Jamie Favazza, Naveen Gavini,
Yiming Jen, Eric Kim, David Liu, Vishwa Patel, Albert
Pereta, Steven Ramkumar, Mike Repass, Eric Sung, Sarah
Tavel, Michelle Vu, Kelei Xu, Jiajing Xu, Zhefei Yu, Cindy
Zhang, and Zhiyuan Zhang for the collaboration on the
product launch, Hui Xu, Vanja Josifovski and Evan Sharp
for the support.
Additionally, we’d like to thank David Tsai, Stephen Hol-
iday, Zhipeng Wu and Yupeng Liao for contributing to the
visual search architecture at VisualGraph prior to the ac-
quisition by Pinterest.
11.
REFERENCES
[1] P. Agrawal, R. Girshick, and J. Malik. Analyzing the
performance of multilayer neural networks for object
recognition. 2014.
[2] K. Aizawa and M. Ogawa. Foodlog: Multimedia tool
for healthcare applications. IEEE MultiMedia,
22(2):4–8, 2015.
[3] S. Bell and K. Bala. Learning visual similarity for
product design with convolutional neural networks.
ACM Trans. Graph., 34(4):98:1–98:10, July 2015.
[4] L. Bertelli, T. Yu, D. Vu, and B. Gokturk. Kernelized
structural svm learning for supervised object
segmentation. In Computer Vision and Pattern
Recognition (CVPR), 2011 IEEE Conference on,
pages 2153–2160. IEEE, 2011.
[5] R. Datta, D. Joshi, J. Li, and J. Wang. Image
retrieval: Ideas, influences, and trends of the new age.
ACM Computing Survey, 40(2):5:1–5:60, May 2008.
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
L. Fei-Fei. ImageNet: A Large-Scale Hierarchical
Image Database. In CVPR09, 2009.
[7] P. F. Felzenszwalb, R. B. Girshick, and D. A.
McAllester. Cascade object detection with deformable
part models. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages
2241–2248, 2010.
[8] A. Frome, Y. Singer, and J. Malik. Image Retrieval
and Classification Using Local Distance Functions. In
B. Schölkopf, J. Platt, and T. Hoffman, editors,
Advances in Neural Information Processing Systems
19, pages 417–424. MIT Press, Cambridge, MA, 2007.
[9] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell.
Compact bilinear pooling. arXiv preprint
arXiv:1511.06062, 2015.
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik.
Rich feature hierarchies for accurate object detection
and semantic segmentation. arXiv preprint
arXiv:1311.2524, 2013.
[11] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and
V. D. Shet. Multi-digit number recognition from street
view imagery using deep convolutional neural
networks. CoRR, abs/1312.6082, 2013.
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual
learning for image recognition. arXiv preprint
arXiv:1512.03385, 2015.
[13] M. Jaderberg, K. Simonyan, A. Vedaldi, and
A. Zisserman. Reading text in the wild with
convolutional neural networks. CoRR, abs/1412.1842,
2014.
[14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev,
J. Long, R. Girshick, S. Guadarrama, and T. Darrell.
Caffe: Convolutional architecture for fast feature
embedding. arXiv preprint arXiv:1408.5093, 2014.
[15] Y. Jing and S. Baluja. Visualrank: Applying pagerank
to large-scale image search. IEEE Transactions on
Pattern Analysis and Machine Intelligence (T-PAMI),
30(11):1877–1890, 2008.
[16] Y. Jing, M. Covell, D. Tsai, and J. M. Rehg. Learning
query-specific distance functions for large-scale web
image search. IEEE Transactions on Multimedia,
15:2022–2034, 2013.
[17] Y. Jing, D. Liu, D. Kislyuk, A. Zhai, J. Xu, and
J. Donahue. Visual search at pinterest. In Proceedings
of the International Conference on Knowledge
Discovery and Data Mining (SIGKDD).
[18] Y. Jing, H. Rowley, J. Wang, D. Tsai, C. Rosenberg,
and M. Covell. Google image swirl: a large-scale
content-based image visualization system. In
Proceedings of the 21st International Conference on
World Wide Web, pages 539–540. ACM, 2012.
[19] M. H. Kiapour, X. Han, S. Lazebnik, A. C. Berg, and
T. L. Berg. Where to buy it:matching street clothing
photos in online shops. In International Conference on
Computer Vision, 2015.
[20] A. Krizhevsky, S. Ilya, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems
(NIPS), pages 1097–1105. 2012.
[21] S. Liu, Z. Song, M. Wang, C. Xu, H. Lu, and S. Yan.
Street-to-shop: Cross-scenario clothing retrieval via
10. [22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
parts alignment and auxiliary set. In Proceedings of
the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2012.
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed,
C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox
detector. 2016. To appear.
Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang.
Deepfashion: Powering robust clothes recognition and
retrieval with rich annotations. In Proceedings of
IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016.
J. Long, E. Shelhamer, and T. Darrell. Fully
convolutional networks for semantic segmentation.
arXiv preprint arXiv:1411.4038, 2014.
M. Everingham, L. Van Gool, C. K. I. Williams, J.
Winn, and A. Zisserman. The PASCAL Visual Object
Classes Challenge 2011 (VOC2011) Results.
http://www.pascalnetwork.org/challenges/VOC/
voc2011/workshop/index.html.
H. Müller, W. Müller, D. M. Squire,
S. Marchand-Maillet, and T. Pun. Performance
evaluation in content-based image retrieval: Overview
and proposals. Pattern Recognition Letter,
22(5):593–601, 2001.
S. Ren, K. He, R. Girshick, and J. Sun. Faster
R-CNN: Towards real-time object detection with
region proposal networks. In Neural Information
Processing Systems (NIPS), 2015.
F. Schroff, D. Kalenichenko, and J. Philbin. Facenet:
A unified embedding for face recognition and
clustering. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2015.
A. Sharif Razavian, H. Azizpour, J. Sullivan, and
S. Carlsson. Cnn features off-the-shelf: an astounding
baseline for recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern
Recognition Workshops, pages 806–813, 2014.
A. Shrivastava, A. Gupta, and R. Girshick. Training
region-based object detectors with online hard
example mining. In The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR),
June 2016.
E. Simo-Serra and H. Ishikawa. Fashion style in 128
floats: Joint ranking and classification using weak
data for feature extraction. In The IEEE Conference
on Computer Vision and Pattern Recognition
(CVPR), June 2016.
K. Simonyan and A. Zisserman. Very deep
convolutional networks for large-scale image
recognition. CoRR, abs/1409.1556, 2014.
H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese.
Deep metric learning via lifted structured feature
embedding. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2016.
C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4,
inception-resnet and the impact of residual
connections on learning. CoRR, abs/1602.07261, 2016.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and
A. Rabinovich. Going deeper with convolutions. arXiv
preprint arXiv:1409.4842, 2014.
[36] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf.
Deepface: Closing the gap to human-level performance
in face verification. In Proceedings of the IEEE
Conference on Computer Vision and Pattern
Recognition, pages 1701–1708, 2014.
[37] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang,
J. Philbin, B. Chen, and Y. Wu. Learning fine-grained
image similarity with deep ranking. In Proceedings of
the 2014 IEEE Conference on Computer Vision and
Pattern Recognition, CVPR ’14, pages 1386–1393,
Washington, DC, USA, 2014. IEEE Computer Society.
[38] K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L.
Berg. Retrieving similar styles to parse clothing. IEEE
Trans. Pattern Anal. Mach. Intell., 37(5):1028–1040,
2015.
[39] D. C. Liu, S. Rogers, R. Shiau, K. Ma, Z. Zhong, D.
Kislyuk, J. Liu, and Y. Jing. Related pins at pinterest,
the evolution of a real-world recommender system. In
Proceedings of the International Conference on World
Wide Web (WWW), 2017
[40] Introducing the future of visual discovery on pinterest.
https://engineering.pinterest.com/blog/introducingfuture-
visual-discovery-pinterest. Published:
2017-02-08.
[41] Our crazy fun new visual search tool.
https://blog.pinterest.com/en/our-crazy-fun-
newvisual-search-tool. Published:
2015-11-08.
[42] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi.
You only look once: Unified, real-time object
detection. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2016.
[43] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z.
Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M.
Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G.
Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M.
Kudlur, J. Levenberg, D. ManÂt’e, R. Monga, S.
Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. A. Tucker, V.
Vanhoucke, V. Vasudevan, F. B. ViÂt’egas, O.
Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu,
and X. Zheng. Tensorflow: Large-scale machine
learning on heterogeneous distributed systems. CoRR,
abs/1603.04467, 2016
[44] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara,
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S.
Guadarrama, et al. Speed/accuracy trade-offs for
modern convolutional object detectors.
arXiv:1611.10012, 2016.