Identify Your Font from An Image
如果无法正常显示,请先停止浏览器的去广告插件。
1. DeepFont: Identify Your Font from An Image
Zhangyang Wang 1
Jianchao Yang 3 Hailin Jin 2 Eli Shechtman 2
Jonathan Brandt 2 Thomas S. Huang 1
1
Aseem Agarwala 4
University of Illinois at Urbana-Champaign
2
Adobe Research
3
Snapchat Inc 4 Google Inc
{zwang119, t-huang1}@illinois.edu, jianchao.yang@snapchat.com, {hljin, elishe, jbrandt}@adobe.com,
aseem@agarwala.org
ABSTRACT 1.
As font is one of the core design concepts, automatic font
identification and similar font suggestion from an image or
photo has been on the wish list of many designers. We
study the Visual Font Recognition (VFR) problem [4], and
advance the state-of-the-art remarkably by developing the
DeepFont system. First of all, we build up the first avail-
able large-scale VFR dataset, named AdobeVFR, consisting
of both labeled synthetic data and partially labeled real-
world data. Next, to combat the domain mismatch between
available training and testing data, we introduce a Convo-
lutional Neural Network (CNN) decomposition approach,
using a domain adaptation technique based on a Stacked
Convolutional Auto-Encoder (SCAE) that exploits a large
corpus of unlabeled real-world text images combined with
synthetic data preprocessed in a specific way. Moreover, we
study a novel learning-based model compression approach,
in order to reduce the DeepFont model size without sacrific-
ing its performance. The DeepFont system achieves an ac-
curacy of higher than 80% (top-5) on our collected dataset,
and also produces a good font similarity measure for font
selection and suggestion. We also achieve around 6 times
compression of the model without any visible loss of recog-
nition accuracy. Typography is fundamental to graphic design. Graphic
designers have the desire to identify the fonts they encounter
in daily life for later use. While they might take a photo of
the text of a particularly interesting font and seek out an ex-
pert to identify the font, the manual identification process
is extremely tedious and error-prone. Several websites allow
users to search and recognize fonts by font similarity, includ-
ing Identifont, MyFonts, WhatTheFont, and Fontspring. All
of them rely on tedious humans interactions and high-quality
manual pre-processing of images, and the accuracies are still
unsatisfactory. On the other hand, the majority of font se-
lection interfaces in existing softwares are simple linear lists,
while exhaustively exploring the entire space of fonts using
an alphabetical listing is unrealistic for most users.
Effective automatic font identification from an image or
photo could greatly ease the above difficulties, and facili-
tate font organization and selection during the design pro-
cess. Such a Visual Font Recognition (VFR) problem is
inherently difficult, as pointed out in [4], due to the huge
space of possible fonts (online repositories provide hundreds
of thousands), the dynamic and open-ended properties of
font classes, and the very subtle and character-dependent
difference among fonts (letter endings, weights, slopes, etc.).
More importantly, while the popular machine learning tech-
niques are data-driven, collecting real-world data for a large
collection of font classes turns out to be extremely difficult.
Most attainable real-world text images do not have font label
information, while the error-prone font labeling task requires
font expertise that is out of reach of most people. The few
previous approaches [1, 9, 12, 16, 17, 20] are mostly from the
document analysis standpoint, which only focus on a small
number of font classes, and are highly sensitive to noise,
blur, perspective distortions, and complex backgrounds. In
[4] the authors proposed a large-scale, learning-based solu-
tion without dependence on character segmentation or OCR.
The core algorithm is built on local feature embedding, local
feature metric learning and max-margin template selection.
However, their results suggest that the robustness to real-
world variations is unsatisfactory, and a higher recognition
accuracy is still demanded.
Inspired by the great success achieved by deep learning
models [10] in many other computer vision tasks, we de-
velop a VFR system for the Roman alphabets, based on
the Convolutional neural networks (CNN), named DeepFont.
Without any dependence on character segmentation or con-
tent text, the DeepFont system obtains an impressive per-
formance on our collected large real-word dataset, covering
Categories and Subject Descriptors
I.4.7 [Image Processing and Computer Vision]: Fea-
ture measurement; I.4.10 [Image Processing and Com-
puter Vision]: Image Representation; I.5 [Pattern Recog-
nition]: Classifier design and evaluation
General Terms
Algorithms, Experimentation
Keywords
Visual Font Recognition; Deep Learning; Domain Adapta-
tion; Model Compression
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from Permissions@acm.org.
MM’15, October 26–30, 2015, Brisbane, Australia.
c 2015 ACM. ISBN 978-1-4503-3459-4/15/10 ...$15.00.
DOI: http://dx.doi.org/10.1145/XXX.XXXXXXX.
INTRODUCTION
2. (b)
(a)
(c)
Figure 1: (a) (b) Successful VFR examples with the DeepFont system. The top row are query images
from VFR real test dataset. Below each query, the results (left column: font classes; right column: images
rendered with the corresponding font classes) are listed in a high-to-low order in term of likelihoods. The
correct results are marked by the red boxes. (c) More correctly recognized real-world images with DeepFont.
an extensive variety of font categories. Our technical con-
tributions are listed below:
• AdobeVFR Dataset A large set of labeled real-world
images as well as a large corpus of unlabeled real-world
data are collected for both training and testing, which
is the first of its kind and is publicly released soon.
We also leverage a large training corpus of labeled syn-
thetic data augmented in a specific way.
• Domain Adapted CNN It is very easy to generate
lots of rendered font examples but very hard to obtain
labeled real-world images for supervised training. This
real-to-synthetic domain gap caused poor generaliza-
tion to new real data in previous VFR methods [4].
We address this domain mismatch problem by lever-
aging synthetic data to obtain effective classification
features, while introducing a domain adaptation tech-
nique based on Stacked Convolutional Auto-Encoder
(SCAE) with the help of unlabeled real-world data.
• Learning-based Model Compression We introduce
a novel learning-based approach to obtain a losslessly
compressible model, for a high compression ratio with-
out sacrificing its performance. An exact low-rank con-
straint is enforced on the targeted weight matrix.
Fig. 1 shows successful VFR examples using DeepFont. In
(a)(b), given the real-world query images, top-5 font recog-
nition results are listed, within which the ground truth font
classes are marked out 1 . More real-world examples are dis-
1
Note that the texts are input manually for rendering pur-
poses only. The font recognition process does not need any
content information.
Table 1: Comparison of All VFR Datasets
Dataset name
VFRWild325 [4]
VFR real test
VFR real u
VFR syn train
VFR syn val
Source
Real
Real
Real
Syn
Syn
Label?
Y
Y
N
Y
Y
Purpose
Test
Test
Train
Train
Test
Size
325
4, 384
197, 396
2,383, 000
238, 300
played in (c). Although accompanied with high levels of
background clutters, size and ratio variations, as well as per-
spective distortions, they are all correctly recognized by the
DeepFont system.
2.
2.1
DATASET
Domain Mismatch between Synthetic and
Real-World Data
To apply machine learning to VFR problem, we require
realistic text images with ground truth font labels. How-
ever, such data is scarce and expensive to obtain. More-
over, the training data requirement is vast, since there are
hundreds of thousands of fonts in use for Roman characters
alone. One way to overcome the training data challenge is to
synthesize the training set by rendering text fragments for
all the necessary fonts. However, to attain effective recog-
nition models with this strategy, we must face the domain
mismatch between synthetic and real-world text images [4].
Class
93
617
/
2, 383
2, 383
3. For example, it is common for designers to edit the spacing,
aspect ratio or alignment of text arbitrarily, to make the
text fit other design components. The result is that charac-
ters in real-world images are spaced, stretched and distorted
in numerous ways. For example, Fig. 2 (a) and (b) depict
typical examples of character spacing and aspect ratio differ-
ences between (standard rendered) synthetic and real-world
images. Other perturbations, such as background clutter,
perspective distortion, noise, and blur, are also ubiquitous.
2.2
The AdobeVFR Dataset
Collecting and labeling real-world examples is notoriously
hard and thus a labeled real-world dataset has been absent
for long. A small dataset VFRWild325 was collected in [4],
consisting of 325 real-world text images and 93 classes. How-
ever, the small size puts its effectiveness in jeopardy.
Chen et. al. in [4] selected 2,420 font classes to work on.
We remove some script classes, ending up with a total of
2,383 font classes. We collected 201,780 text images from
various typography forums, where people post these images
seeking help from experts to identify the fonts. Most of them
come with hand-annotated font labels which may be inaccu-
rate. Unfortunately, only a very small portion of them fall
into our list of 2,383 fonts. All images are first converted
into gray scale. Those images with our target class labels
are then selected and inspected by independent experts if
their labels are correct. Images with verified labels are then
manually cropped with tight bounding boxes and normal-
ized proportionally in size, to be with the identical height
of 105 pixels. Finally, we obtain 4,384 real-world test im-
ages with reliable labels, covering 617 classes (out of 2,383).
Compared to the synthetic data, these images typically have
much larger appearance variations caused by scaling, back-
ground clutter, lighting, noise, perspective distortions, and
compression artifacts. Removing the 4,384 labeled images
from the full set, we are left with 197,396 unlabeled real-
world images which we denote as VFR real u.
To create a sufficiently large set of synthetic training data,
we follow the same way in [4] to render long English words
sampled from a large corpus, and generate tightly cropped,
gray-scale, and size-normalized text images. For each class,
we assign 1,000 images for training, and 100 for validation,
which are denoted as VFR syn train and VFR syn val, re-
spectively. The entire AdobeVFR dataset, consisting of
VFR real test, VFR real u, VFR syn train and VFR syn val,
are made publicly available 2 .
The AdobeVFR dataset is the first large-scale benchmark
set consisting of both synthetic and real-world text images,
for the task of font recognition. To our best knowledge, so
far VFR real test is the largest available set of real-world
text images with reliable font label information (12.5 times
larger than VFRWild325). The AdobeVFR dataset is super
fine-grain, with highly subtle categorical variations, leading
itself to a new challenging dataset for object recognition.
Moreover, the substantial mismatch between synthetic and
real-world data makes the AdobeVFR dataset an ideal sub-
ject for general domain adaption and transfer learning re-
search. It also promotes the new problem area of under-
standing design styles with deep learning.
2.3
2
Synthetic Data Augmentation: A First Step
to Reduce the Mismatch
http://www.atlaswang.com/deepfont.html
(a)
(b)
Figure 2: (a) the different characters spacings be-
tween a pair of synthetic and real-world images. (b)
the different aspect ratio between a pair of synthetic
and real-world image
Before feeding synthetic data into model training, it is
popular to artificially augment training data using label-
preserving transformations to reduce overfitting. In [10], the
authors applied image translations and horizontal reflections
to the training images, as well as altering the intensities of
their RGB channels. The authors in [4] added moderate
distortions and corruptions to the synthetic text images:
• 1. Noise: a small Gaussian noise with zero mean and
standard deviation 3 is added to input
• 2. Blur: a random Gaussian blur with standard de-
viation from 2.5 to 3.5 is added to input
• 3. Perspective Rotation: a randomly-parameterized
affine transformation is added to input
• 4. Shading: the input background is filled with a
gradient in illumination.
The above augmentations cover standard perturbations for
general images, and are adopted by us. However, as a very
particular type of images, text images have various real-
world appearances caused by specific handlings. Based on
the observations in Fig. 2 , we identify two additional font-
specific augmentation steps to our training data:
• 5. Variable Character Spacing: when rendering
each synthetic image, we set the character spacing (by
pixel) to be a Gaussian random variable of mean 10
and standard deviation 40, bounded by [0, 50].
• 6. Variable Aspect Ratio: Before cropping each
image into a input patch, the image, with heigh fixed,
is squeezed in width by a random ratio, drawn from a
uniform distribution between 6 5 and 76 .
Note that these steps are not useful for the method in [4]
because it exploits very localized features. However, as we
show in our experiments, these steps lead to significant per-
formance improvements in our DeepFont system. Overall,
our data augmentation includes steps 1-6.
To leave a visual impression, we take the real-world im-
age Fig. 2 (a), and synthesize a series of images in Fig. 3,
all with the same text but with different data augmentation
ways. Specially, (a) is synthesized with no data augmenta-
tion; (b) is (a) with standard augmentation 1-4 added; (c)
is synthesized with spacing and aspect ratio customized to
be identical to those of Fig. 2 (a); (d) adds standard aug-
mentation 1-4 to (c). We input images (a)-(d) through the
trained DeepFont model. For each image, we compare its
4. (a) Synthetic, none (c) Synthetic, 5-6
(b) Synthetic, 1-4 (d) Synthetic, 1-6
(e) Relative CNN layer-wise responses
Figure 3: The effects of data augmentation steps.
(a)-(d): synthetic images of the same text but with
different data augmentation ways. (e) compares rel-
ative differences of (a)-(d) with the real-world image
Fig. 2 (a), in the measure of layer-wise network ac-
tivations through the same DeepFont model.
further decomposed into two sub-networks: a ”shared” low-
level sub-network which is learned from the composite set of
synthetic and real-world data, and a high-level sub-network
that learns a deep classifier from the low-level features.
The basic CNN architecture is similar to the popular Im-
ageNet structure [10], as in Fig. 5. The numbers along with
the network pipeline specify the dimensions of outputs of
corresponding layers. The input is a 105 × 105 patch sam-
pled from a ”normalized” image. Since a square window may
not capture sufficient discriminative local structures, and
is unlikely to catch high-level combinational features when
two or more graphemes or letters are joined as a single glyph
(e.g., ligatures), we introduce a squeezing operation 3 , that
scales the width of the height-normalized image to be of a
constant ratio relative to the height (2.5 in all our experi-
ments). Note that the squeezing operation is equivalent to
producing “long” rectangular input patches.
When the CNN model is trained fully on a synthetic dataset,
it witnesses a significant performance drop when testing on
real-world data, compared to when applied to another syn-
thetic validation set. This also happens with other models
such as in [4], which uses training and testing sets of similar
properties to ours. It alludes to discrepancies between the
distributions of synthetic and real-world examples. we pro-
pose to decompose the N CNN layers into two sub-networks
to be learned sequentially:
• Unsupervised cross-domain sub-network C u , which
consists of the first K layers of CNN. It accounts for
extracting low-level visual features shared by both syn-
thetic and real-world data domains. C u will be trained
in a unsupervised way, using unlabeled data from both
domains. It constitutes the crucial step that further
minimizes the low-level feature gap, beyond the previ-
ous data augmentation efforts.
layer-wise activations with those of the real image Fig. 2
(a) feeding through the same model, by calculating the nor-
malized MSEs. Fig. 3 (e) shows that those augmentations,
especially the spacing and aspect ratio changes, reduce the
gap between the feature hierarchies of real-world and syn-
thetic data to a large extent. A few synthetic patches after
full data augmentation 1-6 are displayed in Fig. 4. It is
observable that they possess a much more visually similar
appearance to real-world data.
Figure 4: Examples of synthetic training 105 × 105
patches after pre-processing steps 1-6.
3.
3.1
DOMAIN ADAPTED CNN
Domain Adaptation by CNN Decomposi-
tion and SCAE
Despite that data augmentations are helpful to reduce
the domain mismatch, enumerating all possible real-world
degradations is impossible, and may further introduce degra-
dation bias in training. In the section, we propose a learning
framework to leverage both synthetic and real-world data,
using multi-layer CNN decomposition and SCAE-based do-
main adaptation. Our approach extends the domain adap-
tation method in [7] to extract low-level features that repre-
sent both the synthetic and real-world data. We employs a
Convolutional Neural Network (CNN) architecture, which is
• Supervised domain-specific sub-network C s , which
consists of the remaining N − K layers. It accounts for
learning higher-level discriminative features for classi-
fication, based on the shared features from C u . C s
will be trained in a supervised way, using labeled data
from the synthetic domain only.
We show an example of the proposed CNN decomposition in
Fig. 5. The C u and C s parts are marked by red and green
colors, respectively, with N = 8 and K = 2. Note that the
low-level shared features are implied to be independent of
class labels. Therefore in order to address the open-ended
problem of font classes, one may keep re-using the C u sub-
network, and only re-train the C s part.
Learning C u from SCAE Representative unsupervised
feature learning methods, such as the Auto-Encoder and the
Denoising Auto-Encoder, perform a greedy layer-wise pre-
training of weights using unlabeled data alone followed by
supervised fine-tuning ([3]). However, they rely mostly on
fully-connected models and ignore the 2D image structure.
In [13], a Convolutional Auto-Encoder (CAE) was proposed
to learn non-trivial features using a hierarchical unsuper-
vised feature extractor that scales well to high-dimensional
inputs. The CAE architecture is intuitively similar to the
the conventional auto-encoders in [18], except for that their
3
Note squeezing is independent from the variable aspect ra-
tio operation introduced in Section 2.3, as they are for dif-
ferent purposes.
5. Figure 5: The CNN architecture in the DeepFont system, and its decomposition marked by different colors
(N =8, K=2).
weights are shared among all locations in the input, preserv-
ing spatial locality. CAEs can be stacked to form a deep
hierarchy called the Stacked Convolutional Auto-Encoder
(SCAE), where each layer receives its input from a latent
representation of the layer below. Fig. 6 plots the SCAE
architecture for our K = 2 case.
Figure 6: The Stacked Convolutional Auto-Encoder
(SCAE) architecture.
Training Details We first train the SCAE on both syn-
thetic and real-world data in a unsupervised way, with a
learning rate of 0.01 (we do not anneal it through training).
Mean Squared Error (MSE) is used as the loss function. Af-
ter SCAE is learned, its Conv. Layers 1 and 2 are imported
to the CNN in Fig. 5, as the C u sub-network and fixed. The
C s sub-network, based on the output by C u , is then trained
in a supervised manner. We start with the learning rate at
0.01, and follow a common heuristic to manually divide the
learning rate by 10 when the validation error rate stops de-
creasing with the current rate. The “dropout” technique is
applied to fc6 and fc7 layers during training. Both C u and
C s are trained with a default batch size of 128, momentum
of 0.9 and weight decay of 0.0005. The network training is
implemented using the CUDA ConvNet package [10], and
runs on a workstation with 12 Intel Xeon 2.67GHz CPUs
and 1 GTX680 GPU. It takes around 1 day to complete the
entire training pipeline.
Testing Details We adopt multi-scale multi-view testing
to improve the result robustness. For each test image, it
is first normalized to 105 pixels in height, but squeezed in
width by three different random ratios, all drawn from a
uniform distribution between 1.5 and 3.5, matching the ef-
fects of squeezing and variable aspect ratio operations during
training. Under each squeezed scale, five 105 × 105 patches
are sampled at different random locations. That constitutes
in total fifteen test patches, each of which comes with dif-
ferent aspect ratios and views, from one test image. As
every single patch could produce a softmax vector through
the trained CNN, we average all fifteen softmax vectors to
determine the final classification result of the test image.
3.2
Connections to Previous Work
We are not the first to look into an essentially “hierar-
chical” deep architecture for domain adaption. In [15], the
proposed transfer learning approach relies on the unsuper-
vised learning of representations. Bengio et. al hypothesized
in [2] that more levels of representation can give rise to more
abstract, more general features of the raw input, and that
the lower layers of the predictor constitute a hierarchy of
features that can be shared across variants of the input
distribution. The authors in [7] used data from the union
of all domains to learn their shared features, which is dif-
ferent from many previous domain adaptation methods that
focus on learning features in a unsupervised way from the
target domain only. However, their entire network hierarchy
is learned in a unsupervised fashion, except for a simple lin-
ear classier trained on top of the network, i.e., K = N − 1.
In [19], the CNN learned a set of filters from raw images
as the first layer, and those low-level filters are fixed when
training higher layers of the same CNN, i.e., K = 1. In
other words, they either adopt a simple feature extractor
(K = 1), or apply a shallow classifier (K = N − 1). Our
CNN decomposition is different from prior work in that:
• Our feature extractor C u and classier C s are both
deep sub-networks with more than one layer (both K
and N − K are larger than 1), which means that both
are able to perform more sophisticated learning. More
evaluations can be found in Section 5.2.
• We learn “shared-feature” convolutional filters rather
than fully-connected networks such as in [7], the former
of which is more suitable for visual feature extractions.
The domain mismatch between synthetic and real-world data
on the lower-level statistics can occur in more scenarios,
such as real-world face recognition from rendered images or
sketches, recognizing characters in real scenes with synthetic
training, human pose estimation with synthetic images gen-
erated from 3D human body models. We conjecture that
our framework can be applicable to those scenarios as well,
where labeled real-world data is scarce but synthetic data
can be easily rendered.
4.
LEARNING-BASED MODEL COMPRES-
SION
The architecture in Fig. 5 contains a huge number of pa-
rameters. It is widely known that the deep models are heav-
ily over-parameterized [5] and thus those parameters can be
compressed to reduce storage by exploring their structure.
6. For a typical CNN, about 90% of the storage is taken up
by the dense connected layers, which shall be our focus for
mode compression.
One way to shrink the number of parameters is using ma-
trix factorization [6]. Given the parameter W ∈ R m×n , we
factorize it using singular-value decomposition (SVD):
W = U SV T
m×m
(1)
n×n
where U ∈ R
and V ∈ R
are two dense orthogonal
matrices and S ∈ R m×n is a diagonal matrix. To restore an
e , V e and S,
e which denote
approximate W , we can utilize U
the submatrices corresponding to the top k singular vectors
in U and V along with the top k eigenvalue in S:
f = U
e S e V e T
W
(2)
, which
The compression ratio given m, n, and k is k(m+n+1)
mn
is very promising when m, n k. However, the approxima-
tion of SVD is controlled by the decay along the eigenvalues
in S. Even it is verified in Fig. 7 that eigenvalues of weight
matrices usually decay fast (the 6-th largest eigenvalue is
already less than 10% of the largest one in magnitude), the
truncation inevitably leads to information loss, and potential
performance degradations, compared to the uncompressed
model.
(a) Standard scale
5.
5.1
EXPERIMENTS
Analysis of Domain Mismatch
We first analyze the domain mismatch between synthetic
and real-world data, and examine how our synthetic data
augmentation can help. First we define five dataset varia-
tions generated from VFR syn train and VFR real u. These
are denoted by the letters N, S, F, R and FR and are ex-
plained in Table 2.
We train five separate SCAEs, all of the same architecture
as in Fig. 6, using the above five training data variants. The
training and testing errors are all measured by relative MSEs
(normalized by the total energy) and compared in Table 1.
The testing errors are evaluated on both the unaugmented
synthetic dataset N and the real-world dataset R. Ideally,
the better the SCAE captures the features from a domain,
the smaller the reconstruction error will be on that domain.
As revealed by the training errors, real-world data con-
tains rich visual variations and is more difficult to fit. The
sharp performance drop from N to R of SCAE N indicates
that the convolutional features for synthetic and real data
are quite different. This gap is reduced in SCAE S, and fur-
ther in SCAE F, which validates the effectiveness of adding
font-specific data augmentation steps. SCAE R fits the real-
world data best, at the expense of a larger error on N. SCAE
FR achieves an overall best reconstruction performance of
both synthetic and real-world images.
Fig. 8 shows an example patch from a real-world font
image of highly textured characters, and its reconstruction
outputs from all five models. The gradual visual variations
across the results confirm the existence of a mismatch be-
tween synthetic and real-world data, and verify the benefit
of data augmentation as well as learning shared features.
(b) Logarithm scale
Figure 7: The plots of eigenvalues for the fc6 layer
weight matrix in Fig. 5. This densely connected
layer takes up 85% of the total model size.
Instead of first training a model then lossy-compressing
its parameters, we propose to directly learn a losslessly
compressible model (the term “lossless” is referred as there
is no further loss after a model is trained). Assuming the
parameter matrix W of a certain network layer, our goal is to
make sure that its rank is exactly no more than a small
constant k. In terms of implementation, in each iteration,
an extra hard thresholding operation [11] is executed on W
after it is updated by a conventional back propagation step:
W k = U T k (S)V T
(3)
where T k will keep the largest k eigenvalues in S while set-
ting others to zeros. W k is best rank-k approximation of
W , as similarly in (2). However, different from (2), the
proposed method incorporates low-rank approximation into
model training and jointly optimize them as a whole, guar-
anteeing a rank-k weight matrix that is ready to be com-
pressed losslessly by applying (1). Note there are other alter-
natives, such as vector quantization methods [8], that have
been applied to compressing deep models with appealing
performances. We will investigate utilizing them together
to further compress our model in the future.
(a) original (b) SCAE N (c) SCAE S
(d) SCAE F (e) SCAE R (f) SCAE FR
Figure 8: A real-world patch, and its reconstruction
results from the five SCAE models.
5.2
Analysis of Network Structure
Fixing Network Depth N . Given a fixed network com-
plexity (N layers), one may ask about how to best decom-
pose the hierarchy to maximize the overall classification per-
formance on real-world data. Intuitively, we should have
sufficient layers of lower-level feature extractors as well as
enough subsequent layers for good classification of labeled
data. Thus, the depth K of C u should neither be too small
nor too large.
Table 3 shows that while the classification training error
increases with K, the testing error does not vary monoton-
ically. The best performance is obtained with K = 2 (3
7. Table 2: Comparison of Training and Testing Errors (%) of Five SCAEs (K = 2)
Methods
Training Data
Train
Test
N
R
SCAE N
N: VFR syn train, no data augmentation
0.02
3.54 31.28
SCAE S
S: VFR syn train, standard augmentation 1-4
0.21
2.24 19.34
SCAE F
F: VFR syn train, full augmentation 1-6
1.20
1.67 15.26
SCAE R
R:VFR real u, real unlabeled dataset
9.64
5.73 10.87
SCAE FR
FR: Combination of data from F and R
6.52
2.02 14.01
Table 3: Top-5 Testing Errors (%) for Different
Table 4: Top-5 Testing Errors (%) for Different
CNN Decompositions (Varying K, N = 8)
CNN Decompositions (Varying K, N = K + 6)
K
0
1
2
3
4
5
K
1
2
3
4
Train
8.46
9.88
11.23 12.54 15.21 17.88
Train
11.46 11.23 10.84 10.86
VFR real test 20.72 20.31 18.21 18.96 22.52 25.97
VFR real test 21.58 18.21 18.15 18.24
(a) K=1
(b) K=2
(c) K=4
(d) K=5
Figure 9: The reconstruction results of a real-world
patch using SCAE FR, with different K values.
slightly worse), where smaller or larger values of K give sub-
stantially worse performance. When K = 5, all layers are
learned using SCAE, leading to the worst results. Rather
than learning all hidden layers by unsupervised training, as
suggested in [7] and other DL-based transfer learning work,
our CNN decomposition reaches its optimal performance
when higher-layer convolutional filters are still trained by
supervised data. A visual inspection of reconstruction re-
sults of a real-world example in Fig. 9, using SCAE FR with
different K values, shows that a larger K causes less informa-
tion loss during feature extraction and leads to a better re-
construction. But in the meantime, the classification result
may turn worse since noise and irrelevant high frequency de-
tails (e.g. textures) might hamper recognition performance.
The optimal K =2 corresponds to a proper “content-aware”
smoothening, filtering out “noisy” details while keeping rec-
ognizable structural properties of the font style.
Fixing C s or C u Depth. We investigate the influences of
K (the depth of C u ) when the depth of C s (e.g. N − K)
keeps fixed. Table 4 reveals that a deeper C u contributes
little to the results. Similar trends are observed when we fix
K and adjust N (and thus the depth ofC s ). Therefore, we
choose N = 8, K=2 to be the default setting.
5.3
to C u . All trained models are evaluated in term of top-1
and top-5 classification errors, on the VFR syn val dataset
for validation purpose. Benefiting from large learning ca-
pacity, it is clear that DeepFont models fit synthetic data
significantly better than LFE. Notably, the top-5 errors of
all DeepFont models (except for DeepFont CAE R) reach
zero on the validation set, which is quite impressive for such
a fine-grain classification task.
We then compare DeepFont models with LFE on the orig-
inal VFRWild325 dataset in [4]. As seen from Table 5, while
DeepFont S fits synthetic training data best, its performance
is the poorest on real-world data, showing a severe over-
fitting. With two font-specific data augmentations added
in training, the DeepFont F model adapts better to real-
world data, outperforming LFE by roughly 8% in top-5 er-
ror. An additional gain of 2% is obtained when unlabeled
real-world data is utilized in DeepFont CAE FR. Next, the
DeepFont models are evaluated on the new VFR real test
dataset, which is more extensive in size and class coverage.
A large margin of around 5% in top-1 error is gained by
DeepFont CAE FR model over the second best (DeepFont
F), with its top-5 error as low as 18.21%. We will use Deep-
Font CAE FR as the default DeepFont model.
Although SCAE R has the best reconstruction result on
real-world data on which it is trained, it has large training
and testing errors on synthetic data. Since our supervised
training relies fully on synthetic data, an effective feature
extraction for synthetic data is also indispensable. The er-
ror rates of DeepFont CAE R are also worse than those of
DeepFont CAE FR and even DeepFont F on the real-world
data, due to the large mismatch between the low-level and
high-level layers in the CNN.
Recognition Performances on VFR Datasets
We implemented and evaluated the local feature embedding-
based algorithm (LFE) in [4] as a baseline, and include the
four different DeepFont models as specified in Table 5. The
first two models are trained in a fully supervised manner on
F, without any decomposition applied. For each of the later
two models, its corresponding SCAE (SCAE FR for Deep-
Font CAE FR, and SCAE R for DeepFont CAE R) is first
trained and then exports the first two convolutional layers
Figure 10: Failure VFR examples using DeepFont.
8. Another interesting observation is that all methods get
similar top-5 errors on VFRWild325 and VFR real test, show-
ing their statistical similarity. However, the top-1 errors of
DeepFont models on VFRWild325 are significantly higher
than those on VFR real test, with a difference of up to 10%.
In contrast, the top-1 error of LFE rises more than 13% on
VFR real test than on VFRWild325. For the small VFR-
Wild325, the recognition result is easily affected by “bad”
examples (e.g, low resolution or highly compressed images)
and class bias (less than 4% of all classes are covered). On
the other hand, the larger VFR real test dataset dilutes the
possible effect of outliers, and examines a lot more classes.
(a)
(b)
(c)
Figure 11: Examples of the font similarity. For each
one, the top is the query image, and the renderings
with the most similar fonts are returned.
Fig. 10 lists some failure cases of DeepFont. For example,
the top left image contains extra “fluff” decorations along
text boundaries, which is nonexistent in the original fonts,
that makes the algorithm incorrectly map it to some “artis-
tic” fonts. Others are affected by 3-D effects, strong obsta-
cles in foreground, and in background. Being considerably
difficult to be adapted, those examples fail mostly because
there are neither specific augmentation steps handling their
effects, nor enough examples in VFR real u to extract cor-
responding robust features.
5.4
Evaluating Font Similarity using DeepFont
There are a variety of font selection tasks with different
goals and requirements. One designer may wish to match a
font to the style of a particular image. Another may wish
to find a free font which looks similar to a commercial font
such as Helvetica. A third may simply be exploring a large
set of fonts such as Adobe TypeKit or Google Web Fonts.
Exhaustively exploring the entire space of fonts using an
alphabetical listing is unrealistic for most users. The authors
in [14] proposed to select fonts based on online crowdsourced
attributes, and explore font similarity, from which a user
is enabled to explore other visually similar fonts given a
specific font. The font similarity measure is very helpful for
font selection, organization, browsing, and suggestion.
Based on our DeepFont system, we are able to build up
measures of font similarity. We use the 4096 × 1 outputs of
the fc7 layer as the high-level feature vectors describing font
visual appearances. We then extract such features from all
samples in VFR syn val Dataset, obtaining 100 feature vec-
tors per class. Next for each class, the 100 feature vectors
is averaged to a representative vector. Finally, we calculate
the Euclidean distance between the representative vectors of
two font classes as their similarity measure. Visualized ex-
amples are demonstrated in Fig. 11. For each example, the
top is the query image of a known font class; the most simi-
lar fonts obtained by the font similarity measures are sorted
below. Note that although the result fonts can belong to
different font families from the query, they share identifiable
visual similarities by human perception.
Although not numerically verified as in [14], the DeepFont
results are qualitatively better when we look at the top-10
most similar fonts for a wide range of query fonts. The
authors of [14] agree per personal communication with us.
5.5
DeepFont Model Compression
Since the fc6 layer takes up 85% of the total model size, we
first focus on its compression. We start from a well-trained
DeepFont model (DeepFont CAE FR), and continue tuning
it with the hard thresholding (3) applied to the fc6 parame-
ter matrix W in each iteration, until the training/validation
errors reach the plateau again.
Table 6 compares the DeepFont models compressed us-
ing conventional matrix factorization (denoted as the “lossy”
method), and the proposed learning based method (denoted
as the “lossless” method), under different compression ratios
(fc6 and total size counted by parameter numbers). The
last column of Table 6 lists the top-5 testing errors (%) on
VFR real test. We observe a consistent margin of the “loss-
less” method over its “lossy” counterpart, which becomes
more significant when the compression ratio goes low (more
than 1% when k = 5). Notably, when k = 100, the pro-
posed “lossless” compression suffers no visible performance
loss, while still maintaining a good compression ratio of 5.79.
In practice, it takes around 700 megabytes to store all the
parameters in our uncompressed DeepFont model, which is
quite huge to be embedded or downloaded into most cus-
tomer softwares. More aggressively, we reduce the output
sizes of both fc6 and fc7 to 2048, and further apply the pro-
9. Table 5: Comparison of Training and Testing Errors on Synthetic and Real-world Datasets (%)
Methods
Training Data Training
VFR syn val
VFRWild325
VFR real test
C u
C s
Error
Top-1 Top-5 Top-1 Top-5 Top-1 Top-5
LFE
/
/
/
26.50
6.55
44.13
30.25
57.44
32.69
DeepFont S
/
F
0.84
1.03
0
64.60
57.23
57.51
50.76
DeepFont F
/
F
8.46
7.40
0
43.10
22.47
33.30
20.72
DeepFont CAE FR FR
F
11.23
6.58
0
38.15 20.62 28.58 18.21
DeepFont CAE R
R
F
13.67
8.21
1.26
44.62
29.23
39.46
27.33
Table 6: Performance Comparisons of Lossy and
Lossless Compression Approaches
fc6 size
Total size Ratio Method Error
default 150,994,944 177,546,176
NA
NA
18.21
Lossy
20.67
k=5
204,805
26,756,037
6.64
Lossless
19.23
Lossy
19.25
k=10
409,610
26,960,842
6.59
Lossless
18.87
Lossy
19.04
k=50
2,048,050
28,599,282
6.21
Lossless
18.67
Lossy
18.68
k=100
4,096,100
30,647,332
5.79
Lossless
18.21
posed compression method (k = 10) to the fc6 parameter
matrix. The obtained “mini” model, with only 9, 477, 066
parameters and a high compression ratio of 18.73, becomes
less than 40 megabytes in storage. Being portable even on
mobiles, It manages to keep a top-5 error rate around 22%.
6.
CONCLUSION
In the paper, we develop the DeepFont system to remark-
ably advance the state-of-the-art in the VFR task. A large
set of labeled real-world data as well as a large corpus of un-
labeled real-world images is collected for both training and
testing, which is the first of its kind and will be made pub-
licly available soon. While relying on the learning capacity
of CNN, we need to combat the mismatch between available
training and testing data. The introduction of SCAE-based
domain adaption helps our trained model achieve a higher
than 80% top-5 accuracy. A novel lossless model compres-
sion is further applied to promote the model storage effi-
ciency. The DeepFont system not only is effective for font
recognition, but can also produce a font similarity measure
for font selection and suggestion.
7.
REFERENCES
[1] C. Avilés-Cruz, R. Rangel-Kuoppa, M. Reyes-Ayala,
A. Andrade-Gonzalez, and R. Escarela-Perez.
High-order statistical texture analysis: font
recognition applied. PRL, 26(2):135–145, 2005.
[2] Y. Bengio. Learning deep architectures for ai.
Foundations and trends R in Machine Learning,
2(1):1–127, 2009.
[3] Y. Bengio, P. Lamblin, D. Popovici, and
H. Larochelle. Greedy layer-wise training of deep
networks. NIPS, 19:153, 2007.
[4] G. Chen, J. Yang, H. Jin, J. Brandt, E. Shechtman,
A. Agarwala, and T. X. Han. Large-scale visual font
recognition. In CVPR, pages 3598–3605. IEEE, 2014.
[5] M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al.
Predicting parameters in deep learning. In NIPS,
pages 2148–2156, 2013.
[6] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and
R. Fergus. Exploiting linear structure within
convolutional networks for efficient evaluation. In
NIPS, pages 1269–1277, 2014.
[7] X. Glorot, A. Bordes, and Y. Bengio. Domain
adaptation for large-scale sentiment classification: A
deep learning approach. In ICML, 2011.
[8] Y. Gong, L. Liu, M. Yang, and L. Bourdev.
Compressing deep convolutional networks using vector
quantization. arXiv preprint arXiv:1412.6115, 2014.
[9] M.-C. Jung, Y.-C. Shin, and S. N. Srihari. Multifont
classification using typographical attributes. In
ICDAR, pages 353–356. IEEE, 1999.
[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural
networks. In NIPS, pages 1097–1105, 2012.
[11] Z. Lin, M. Chen, and Y. Ma. The augmented lagrange
multiplier method for exact recovery of corrupted
low-rank matrices. arXiv preprint:1009.5055, 2010.
[12] H. Ma and D. Doermann. Gabor filter based
multi-class classifier for scanned document images. In
ICDAR, volume 2, pages 968–968. IEEE, 2003.
[13] J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber.
Stacked convolutional auto-encoders for hierarchical
feature extraction. In ICANN, pages 52–59. 2011.
[14] P. O’Donovan, J. Lı̄beks, A. Agarwala, and
A. Hertzmann. Exploratory font selection using
crowdsourced attributes. ACM TOG, 33(4):92, 2014.
[15] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng.
Self-taught learning: transfer learning from unlabeled
data. In ICML, pages 759–766. ACM, 2007.
[16] R. Ramanathan, K. Soman, L. Thaneshwaran,
V. Viknesh, T. Arunkumar, and P. Yuvaraj. A novel
technique for english font recognition using support
vector machines. In ARTCom, pages 766–769, 2009.
[17] H.-M. Sun. Multi-linguistic optical font recognition
using stroke templates. In ICPR, volume 2, pages
889–892. IEEE, 2006.
[18] P. Vincent, H. Larochelle, Y. Bengio, and P.-A.
Manzagol. Extracting and composing robust features
with denoising autoencoders. In ICML, pages
1096–1103. ACM, 2008.
[19] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng.
End-to-end text recognition with convolutional neural
networks. In ICPR, pages 3304–3308. IEEE, 2012.
[20] Y. Zhu, T. Tan, and Y. Wang. Font recognition based
on global texture analysis. IEEE TPAMI, 2001.