A Bi-directional LSTM Approach for Polyphone Disambiguation in Mandarin Chinese
如果无法正常显示,请先停止浏览器的去广告插件。
1. A Bi-directional LSTM Approach for Polyphone Disambiguation in Mandarin
Chinese
Changhao Shan 1 , Lei Xie 1 , Kaisheng Yao 2
1
School of Computer Science, Northwestern Polytechnical University, Xi’an, China
2
Microsoft Corporation, Redmond, 98052, WA, USA
{chshan, lxie}@nwpu-aslp.org, kaisheny@microsoft.com
Abstract
Polyphone disambiguation in Mandarin Chinese aims to pick
up the correct pronunciation from several candidates for a poly-
phonic character. It serves as an essential component in human
language technologies such as text-to-speech synthesis. Since
the pronunciation for most polyphonic characters can be easily
decided from their contexts in the text, in this paper, we ad-
dress the polyphone disambiguation problem as a sequential la-
beling task. Specifically, we propose to use bidirectional long
short-term memory (BLSTM) neural network to encode both
the past and future observations on the character sequence as
its inputs and predict the pronunciations. We also empirically
study the impacts of (1) modeling different length of contexts,
(2) the number of BLSTM layers and (3) the granularity of part-
o-speech (POS) tags as features. Our results show that using a
deep BLSTM is able to achieve state-of-the-art performance in
polyphone disambiguation.
Index Terms: Polyphone disambiguation, Grapheme-to-
phoneme conversion, Sequence tagging, Bi-directional LSTM,
Text-to-Speech
1. Introduction
Grapheme-to-phoneme (G2P) conversion aims to predict
the pronunciation of a word given its orthography, i.e., a series
of characters or graphemes. It is an essential component in hu-
man language technologies, especially in speech synthesis and
speech recognition. In an alphabetic language such as English,
the main problem a G2P module faces is to generate pronuncia-
tion for out-of-the-vocabulary (OOV) words [1, 2, 3, 4, 5, 6, 7].
However, in Chinese, a character-based language, most char-
acters have only one fixed pronunciation and each character is
pronounced as a tonal syllable. So the difficulty in Chinese G2P
conversion is polyphone disambiguation, which aims to pick out
one correct pronunciation from several candidates for a poly-
phonic character [8].
A variety of approaches have been proposed to address
the polyphone disambiguation problem. They can be catego-
rized into knowledge-based and learning-based approaches. A
rich pronunciation dictionary and human rules are essential in
a knowledge-based system. The dictionary is designed to list
as many as possible the words with polyphonic characters and
their pronunciations [9]. But the dictionary cannot cover all the
polyphonic cases in the language. Hence some rules are crafted
by language experts to handle these cases. During the runtime,
the pronunciation of a polyphone is firstly searched in the dic-
tionary. If not found, look up the manual rules to determine the
pronunciation.
Knowledge-based approach heavily relies on human ex-
pertise, while learning-based approach aims to automatically
learn a polyphone disambiguation model from a set of data.
Similar to G2P conversion in English, a joint n-gram model
can be used in polyphone disambiguation. Because of the rela-
tively small pronunciation cardinality of polyphonic character-
s, i.e., two to four, n-gram statistics can be reliably obtained
from a training set, leading to reasonable performance. Such n-
gram models are usually implemented as a weighted finite state
transducer (WFST). Polyphone disambiguation can be treated
as a classification task. Based on a set of features extracted
for a polyphone, its pronunciation is predicted from a set of
candidates through a decision tree (DT) [10] or a maximum en-
tropy (Maxent) model [11]. A study has shown that a Maxent
model outperforms DT in polyphone disambiguation [11]. A
learning-based approach can be integrated with a knowledge-
based approach to form a hybrid approach [12], where most of
the polyphones are disambiguated by a learned model, but the
pronunciations of some polyphonic characters are determined
by human rules.
Since the pronunciation for most polyphonic character-
s can be easily decided from their contexts, in this paper, we
address the polyphone disambiguation task as a sequential la-
beling (or tagging) task that models the important contextu-
al information. Specifically, we propose to use bidirectional
long short-term memory (BLSTM) neural network in pronun-
ciation determination of polyphones in Mandarin Chinese. Our
approach is motivated by recent tremendous success of BLST-
M models in English G2P conversion [1] [13] and various se-
quential learning tasks [14, 15, 16, 17]. LSTM uses specifically
designed gates to control information flow and thus has excep-
tional context modeling ability [18]. BLSTM is composed of a
forward LSTM and a backward LSTM, and thus it can model
both the past and the future contexts. However, the LSTM mod-
els used in English G2P is not readily applicable to polyphone
disambiguation in Mandarin. Firstly, besides the tremendous
difference between the two languages, we shall determine the
best context to be modeled in the task. Secondly, recent studies
suggest that using multiple hidden layers can learn hierarchical
features and boost performances. We would like to investigate
whether using deep BLSTM architecture can benefit this task.
Our results show that using a deep bidirectional LSTM is able to
achieve state-of-the-art performance in polyphone disambigua-
tion.
2. Features
In Chinese, the pronunciation for a polyphonic character
usually can be determined by the word that contains it and its
neighboring words. For the examples in Table 1, we can eas-
2. Table 1: Examples showing that the word that contains the poly-
phone and its POS tag are important features for polyphone
disambiguation. The polyphone is in bold font.
Word
Š3
Š
Š
Pronunciation
zhao1
chao2
zhao1
POS
n
ns
n
English translation
morning and night
Chaoyang (a place name)
morning sun
Figure 1: Recurrent neural network unrolled in time.
ily get the pronunciation of polyphone ‘Š’, if the word that
contains the polyphone is shown. Hence the identity of the
word that contains the polyphone is an important feature. How-
ever, in Table 1, we still cannot make a distinction between
“Š(zhao1) ” and “Š(chao2) ”. But the two words can be
discriminated (and the pronunciation of the polyphone ‘Š’ can
be determined) if we have the part-of-speech (POS) informa-
tion. If the word is tagged as ’ns’, then the polyphonic char-
acter ‘Š’ in the word “Š ” is pronounced as ’chao2’. If
tagged as ’n’, then ‘Š’ in the word “Š ” is pronounced as
’zhao1’. Thus POS tag is an important feature for polyphone
disambiguation. Therefore, we simply employ the identity of
the word that contains the polyphone and it’s POS tag as fea-
tures.
Table 2 shows that the left and right contexts are also quite
useful for polyphone disambiguation. In this example, the char-
acter ‘=’ can be used as two verbs with both different meanings
and different pronunciations. But the pronunciations of the two
‘=’ can be easily discriminated if we observe their contexts (d-
ifferent POS tags). This is also the reason that motivates us to
use a recurrent network to modeling the contexts in the poly-
phone disambiguation task.
Table 2: Examples showing that contexts are useful for poly-
phone disambiguation. The polyphone is in bold font.
Sequence
v = v ® ns
Çi nz = v ©Ñ n
Pronunciation
zhuan4
zhuan3
English translation
play around in Beijing
convert Hanzi to Pinyin
3. Model
3.1. LSTM
We treat polyphone disambiguation as a sequence tagging
task. In the G2P conversion task [1, 2, 3, 4, 5, 6, 7], a neu-
ral network accepts a sequence of characters and outputs a se-
quence of pronunciations. While in polyphone disambiguation
of Mandarin Chinese, the input is a sequence of characters with
one or more polyphonic characters inside, and the outputs for
the polyphonic character and other non-polyphonic character
are the predicted pronunciation and a NULL symbol (‘-’), re-
spectively, as shown in Table 3.
Table 3: Treating polyphone disambiguation as a sequence tag-
ging task. The input is a sequence of character. If the character
is polyphone, the output is its predicted pronunciation; other-
wise, the output is a NULl symbol ’-’.
Input character
Output pronunciation
·
−
3
−
−
Ñ
du1
Q
−
Figure 2: A single LSTM memory cell with different gates to
control information flow.
Specifically, we use LSTM recurrent neural network (RN-
N) to do the sequence tagging. Allowing cyclical connections
in a feed-forward neural network, we obtain a recurrent net-
work, as shown in Figure 1. RNNs, especially those network-
s with LSTM cells, have recently produced promising results
on a variety of tasks including language modeling [19] [20],
speech recognition [21] and other sequential labelling tasks
[14, 15, 16, 17, 22, 23, 24]. LSTM [18] uses purpose-built mem-
ory cells to store information, which is designed to model a long
range of context. LSTM is composed of a set of recurrently con-
nected memory blocks and each block consists of one or more
self-connected memory cells and three multiplicative gates, i.e.,
input gate, forget gate and output gate, as shown in Figure 2.
The three gates are designed to capture long-range contextual
information by using nonlinear summation units. For LSTM,
the recurrent hidden layer function is implemented as follows:
i t
f t
c t
o t
h t
= σ(W xi x t + W hi h t−1 + W ci c t−1 + b i )
= σ(W xf x t + W hf h t−1 + W cf c t−1 + b f )
= f t c t−1 + i t tanh(W xc x t + W hc h t−1 + b c )
= σ(W xo x t + W ho h t−1 + W co c t + b o )
= o t tanh(c t )
where x t is the input feature vector; σ is the element-wise l-
ogistic sigmoid function; i, f , o and c denote the input gate,
forget gate, output gate and memory cell respectively, and all of
them are the same size as the LSTM output vector h; W xi is
the input-input gate matrix, W hc is the hidden-cell matrix, and
so on; is the element-wise product.
3.2. Bidirectional LSTM
One shortcoming of LSTM is that it is unidirectional: it
only makes use of previous context. But in polyphone disam-
3. Figure 3: A bi-directional LSTM network reads “3
Ñ
ÜS” for the forward directional LSTM, the time-revised se-
quence “ÜS Ñ 3” for the backward directional LSTM.
biguation, the past and future contexts are both important (as
can be seen in Section 4.3). Thus bidirectional LSTM (BLST-
M) architecture is used for polyphone disambiguation, as shown
in Figure 3. BLSTM consists of a forward LSTM and a back-
ward LSTM, and the outputs of the two sub-networks are then
combined [21]. Given an input sequence (x 1 , x 2 , . . . , x n ), the
forward LSTM reads it from left to right, but the backward L-
STM reads it in a reversed order. These two networks have
different parameters. BLSTM can utilize both past inputs and
future inputs for a specific time.
Figure 4: The flow digram of the LSTM-based polyphone dis-
ambiguation system.
tences from the Internet and manually label the pronunciations
(i.e., Pinyin) of the 79 polyphonic characters appeared in these
sentences. We manually divide the corpus into a training set
with 167221 sentences (179410 polyphonic characters) and a
test set with 7678 sentences (10500 polyphonic characters).
4.2. Experimental setups
3.3. Polyphone disambiguation using BLSTM
Figure 4 shows the flow digram of the LSTM-based poly-
phone disambiguation system. Given an input character se-
quence, firstly we perform word segmentation and POS tagging.
In this study, we put the word containing the polyphonic char-
acter as the center of the sequence and consider the word’s left
and right neighboring words as contexts. In the example shown
in Figure 4, “ Ñ” is the centering word containing the poly-
phone ‘Ñ’, while its left and right neighbors are words ‘3’ and
“ÜS”. From the POS tag sequence, we then generate a token
sequence, in which the center word is separated into several to-
kens if it has multiple characters. In the example, “ Ñ n”
is divided into “
n” and “Ñ n”. In the token sequence, the
left and right words are treated as single tokens (do not sepa-
rate if have multiple characters) and we only use their POS tags
as features. We found that the POS tags of neighboring words
are more useful for polyphone disambiguation. The token se-
quence is then represented as a feature vector sequence. Each
feature vector has a character identity sub-vector, a polyphone i-
dentity sub-vector and a POS tag sub-vector. Finally the feature
vector sequence is fed into the BLSTM network, resulting in a
pronunciation sequence with the prediction of the polyphone.
The network output is represented by a posterior vector which
is composed of all possible pronunciations of the considered
polyphones (in this study 79 polyphones with 186 pronuncia-
tions) and a NULL label. We pick up the pronunciation with
the highest posterior as the result.
4. Experiments
4.1. Dataset
We choose 79 most frequently used polyphones for the
polyphone disambiguation experiments. We crawl 174899 sen-
We use the NLPIR toolkit [25] to perform the POS tagging
on the text. The Kaldi toolkit [26] is adopted to implement the
neural networks. To speed-up training, we use data parallelis-
m with 512 character sequences per mini-batch. Our networks
have hybrid structures with one feed-forward layer sitting on
top of K BLSTM layers, where the best K is empirically se-
lected through experiments. Our empirical experiments show
that this network structure can ensure the training steadily con-
verges. The number of nodes in each hidden layer is set to 512.
The hidden activation function is sigmoid and the output lay-
er activation function is softmax. We use cross-entropy as the
error function in the training. The initial learning rate is set to
0.01, but we halve the learning rate if there is no improvement
in the cross-entropy score. We also realize a joint n-gram ap-
proach [2] (n=2) and a Maxent approach [11] for comparison.
The training and test sets are kept the same with those in the
BLSTM model training.
4.3. Modeling context
We investigate the effects from different contextual inputs,
and the results are summarized in Table 4. In this experiment,
we use two LSTM/BLSTM layers (K = 2) and test differen-
t contexts. BLSTM can model both past and future contexts,
while forward and backward (unidirectional) LSTMs can mod-
el past-only and future-only contexts, respectively. From the
results, we can clearly see that, without the past and future in-
puts (0 words), the accuracy of BLSTM degrades to the same
level with the joint n-gram approach and unidirectional LSTMs
have even lower accuracy. The best performances are achieved
by the context of 1 words. But further expanding the context to
2 words results in clear accuracy degradations. BLSTM shows
consistently better accuracy over unidirectional LSTMs. This
observation shows that the use of both past and future contexts
4. Table 4: Polyphone disambiguation accuracy for LSTM with
different contextual inputs.
Context
0 words
1 words
2 words
Forward LSTM
85.07
88.56
87.42
Backward LSTM
87.26
90.78
89.98
BLSTM
89.64
93.83
92.96
Table 5: Polyphone disambiguation accuracy for BLSTM with
different layers.
Layer
1
2
3
Acc (%)
93.71
93.83
93.73
is essential in polyphone disambiguation.
4.4. Number of BLSTM Layers
We further test networks with different BLSTM layers. In
this experiment, the input context is set to ±1 words. Result-
s are shown in Table 5. We notice that one layer BLSTM can
already achieve a good performance, but the best performance
is achieved by a deeper network with two BLSTM layers. Fur-
ther deepening the network to 3 layers has negative effect. We
believe that this may be caused by the limited data.
Approach
Joint n-gram [2]
Maxent [11]
BLSTM
Acc (%)
89.60
88.96
93.83
5. Conclusion
In this paper, we address the polyphone disambiguation
problem as a sequential labeling task. Specifically, we propose
to use bidirectional long short-term memory (BLSTM) neural
network to encode both the past and future observations on the
character sequence as its inputs and predict the pronunciations.
Our conclusions are as follows. 1) By modeling both past and
future contexts of inputs, bidirectional LSTM significantly out-
performs unidirectional LSTM in polyphone disambiguation.
2) A 2-layer BLSTM model achieves superior performance. 3)
A finer granularity in POS tagging is able to lead to better per-
formance. We have observed relative accuracy improvements
of 4.7% and 5.5% as compared with the joint n-gram approach
and the Maxent approach, respectively. In future, it may be in-
teresting to investigate if a simple recurrent neural network can
achieve similar performance.
6. Acknowledgements
This work was supported by the National Natural Science
Foundation of China( Grant No. 61571363).
4.5. POS tagging granularity
As we discussed in Section 2, POS tags are critical fea-
tures for polyphone disambiguation. So we examine the im-
pacts from POS tagging granularity in the BLSTM-based poly-
phone disambiguation task. In the experiments, we use a 2-layer
BLSTM ( i.e., K=2) and the context of ±1 words. We study t-
wo sets of POS tags as the network input. Table 6 provides the
accuracy for two POS tagging tools, i.e., LTP [27] and NLPIR
[25]. The LTP tagger outputs 28 different POS tags while the
NLPIR outputs 90 different POS tags. We can clearly observe
that the BLSTM with NLPIR tags outperforms the BLSTM with
LTP tags. This means a finer granularity in POS tagging can
lead to better polyphone disambiguation performance.
Table 6: Polyphone disambiguation accuracy for two POS
granularity (28 vs. 90) in the BLSTM-based approach.
Tool (POS granularity)
LTP (28)
NLPIR (90)
Table 7: Polyphone disambiguation accuracy for BLSTM and
other approaches for comparison.
Acc (%)
93.65
93.83
4.6. Comparison with other approaches
Polyphone disambiguation accuracy of different approach-
es is summarized in Table 7. In BLSTM, K is set to 2 and
the context is set to ±1 words. From Table 7, we clearly see
that the BLSTM approach significantly outperforms the joint
n-gram approach [2] and the Maxent approach [11] ( The fea-
ture is similar to BLSTM). The differences are significant at
95% confident level with paired t-tests. The relative accuracy
improvements are 4.7% and 5.5% as compared with the joint
n-gram approach and the Maxent approach, respectively.
7. References
[1] K. Rao, F. Peng, H. Sak, and F. Beaufays, “Grapheme-to-phoneme
conversion using long short-term memory recurrent neural net-
works,” in ICASSP. IEEE, 2015, pp. 4225–4229.
[2] J. R. Novak, N. Minematsu, and K. Hirose, “Failure transitions
for joint n-gram models and g2p conversion.” in INTERSPEECH,
2013, pp. 1821–1825.
[3] M. Bisani and H. Ney, “Joint-sequence models for grapheme-to-
phoneme conversion,” Speech Communication, vol. 50, no. 5, pp.
434–451, 2008.
[4] S. F. Chen et al., “Conditional and joint models for grapheme-to-
phoneme conversion.” in INTERSPEECH, 2003.
[5] L. Galescu and J. F. Allen, “Bi-directional conversion between
graphemes and phonemes using a joint n-gram model,” in 4th IS-
CA Tutorial and Research Workshop (ITRW) on Speech Synthesis,
2001.
[6] J. R. Novak, N. Minematsu, K. Hirose, C. Hori, H. Kashioka, and
P. R. Dixon, “Improving wfst-based g2p conversion with align-
ment constraints and rnnlm n-best rescoring.” in INTERSPEECH,
2012, pp. 2526–2529.
[7] D. Caseiro, I. Trancoso, L. Oliveira, and C. Viana, “Grapheme-to-
phone using finite state transducers,” in Proc. 2002 IEEE Work-
shop on Speech Synthesis, vol. 2, 2002, pp. 1349–1360.
[8] Z.-R. Zhang, M. Chu, and E. Chang, “An efficient way to learn
rules for grapheme-to-phoneme conversion in chinese,” in Inter-
national Symposium on Chinese Spoken Language Processing,
2002.
[9] D. Gou and W. Luo, “Processing of polyphone character in chi-
nese tts system,” Chinese Information, no. 1, pp. 33–36.
[10] F. Liu and Y. Zhou, “Polyphone disamblgaatlou based on tree-
guided tbl,” Computer Engineering and Application, vol. 47, no.
12, pp. 137–140, 2011.
5. [11] F. Liu, Q. Shi, and J. Tao, “Maximum entropy based homograph
disambiguation,” in NCMMSC2007, 2007.
[12] M. Fan, G. Hu, and R. Wang, “Multi-level polyphone disambigua-
tion for mandarin grapheme-phoneme conversion,” Computer En-
gineering and Application, vol. 42, no. 2, pp. 167–170, 2006.
[13] K. Yao and G. Zweig, “Sequence-to-sequence neural net models
for grapheme-to-phoneme conversion,” in Sixteenth Annual Con-
ference of the International Speech Communication Association,
2015.
[14] K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi, “Spo-
ken language understanding using long short-term memory neural
networks,” in SLT, 2014 IEEE. IEEE, 2014, pp. 189–194.
[15] Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm-crf models for
sequence tagging,” arXiv preprint arXiv:1508.01991, 2015.
[16] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition
with deep recurrent neural networks,” in 2013 IEEE international
conference on acoustics, speech and signal processing. IEEE,
2013, pp. 6645–6649.
[17] T. Mikolov and G. Zweig, “Context dependent recurrent neural
network language model.” in SLT, 2012, pp. 234–239.
[18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[19] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudan-
pur, “Recurrent neural network based language model.” in Inter-
speech, vol. 2, 2010, p. 3.
[20] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Černockỳ, “S-
trategies for training large scale neural network language models,”
in ASRU, 2011 IEEE Workshop on. IEEE, 2011, pp. 196–201.
[21] A. Graves and J. Schmidhuber, “Framewise phoneme classifica-
tion with bidirectional lstm and other neural network architec-
tures,” Neural Networks, vol. 18, no. 5, pp. 602–610, 2005.
[22] K. Yao, B. Peng, G. Zweig, D. Yu, X. Li, and F. Gao, “Recurrent
conditional random field for language understanding,” in ICASSP.
IEEE, 2014, pp. 4077–4081.
[23] C. Ding, L. Xie, J. Yan, W. Zhang, and Y. Liu, “Automatic prosody
prediction for chinese speech synthesis using blstm-rnn and em-
bedding features,” in ASRU. IEEE, 2015, pp. 98–102.
[24] O. Tilk and T. Alumäe, “Lstm for punctuation restoration in
speech transcripts,” in Sixteenth Annual Conference of the Inter-
national Speech Communication Association, 2015.
[25] L. Zhou and D. Zhang, “Nlpir: A theoretical framework for apply-
ing natural language processing to information retrieval,” Journal
of the American Society for Information Science and Technology,
vol. 54, no. 2, pp. 115–123, 2003.
[26] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J.
Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recog-
nition toolkit,” in IEEE 2011 Workshop on ASRU. IEEE Signal
Processing Society, Dec. 2011, iEEE Catalog No.: CFP11SRW-
USB.
[27] W. Che, Z. Li, and T. Liu, “Ltp: A chinese language technology
platform,” in Proceedings of the 23rd International Conference
on Computational Linguistics: Demonstrations. Association for
Computational Linguistics, 2010, pp. 13–16.