A Bi-directional LSTM Approach for Polyphone Disambiguation in Mandarin Chinese

如果无法正常显示，请先停止浏览器的去广告插件。

1. A Bi-directional LSTM Approach for Polyphone Disambiguation in Mandarin Chinese Changhao Shan 1 , Lei Xie 1 , Kaisheng Yao 2 1 School of Computer Science, Northwestern Polytechnical University, Xi’an, China 2 Microsoft Corporation, Redmond, 98052, WA, USA {chshan, lxie}@nwpu-aslp.org, kaisheny@microsoft.com Abstract Polyphone disambiguation in Mandarin Chinese aims to pick up the correct pronunciation from several candidates for a poly- phonic character. It serves as an essential component in human language technologies such as text-to-speech synthesis. Since the pronunciation for most polyphonic characters can be easily decided from their contexts in the text, in this paper, we ad- dress the polyphone disambiguation problem as a sequential la- beling task. Specifically, we propose to use bidirectional long short-term memory (BLSTM) neural network to encode both the past and future observations on the character sequence as its inputs and predict the pronunciations. We also empirically study the impacts of (1) modeling different length of contexts, (2) the number of BLSTM layers and (3) the granularity of part- o-speech (POS) tags as features. Our results show that using a deep BLSTM is able to achieve state-of-the-art performance in polyphone disambiguation. Index Terms: Polyphone disambiguation, Grapheme-to- phoneme conversion, Sequence tagging, Bi-directional LSTM, Text-to-Speech 1. Introduction Grapheme-to-phoneme (G2P) conversion aims to predict the pronunciation of a word given its orthography, i.e., a series of characters or graphemes. It is an essential component in hu- man language technologies, especially in speech synthesis and speech recognition. In an alphabetic language such as English, the main problem a G2P module faces is to generate pronuncia- tion for out-of-the-vocabulary (OOV) words [1, 2, 3, 4, 5, 6, 7]. However, in Chinese, a character-based language, most char- acters have only one fixed pronunciation and each character is pronounced as a tonal syllable. So the difficulty in Chinese G2P conversion is polyphone disambiguation, which aims to pick out one correct pronunciation from several candidates for a poly- phonic character [8]. A variety of approaches have been proposed to address the polyphone disambiguation problem. They can be catego- rized into knowledge-based and learning-based approaches. A rich pronunciation dictionary and human rules are essential in a knowledge-based system. The dictionary is designed to list as many as possible the words with polyphonic characters and their pronunciations [9]. But the dictionary cannot cover all the polyphonic cases in the language. Hence some rules are crafted by language experts to handle these cases. During the runtime, the pronunciation of a polyphone is firstly searched in the dic- tionary. If not found, look up the manual rules to determine the pronunciation. Knowledge-based approach heavily relies on human ex- pertise, while learning-based approach aims to automatically learn a polyphone disambiguation model from a set of data. Similar to G2P conversion in English, a joint n-gram model can be used in polyphone disambiguation. Because of the rela- tively small pronunciation cardinality of polyphonic character- s, i.e., two to four, n-gram statistics can be reliably obtained from a training set, leading to reasonable performance. Such n- gram models are usually implemented as a weighted finite state transducer (WFST). Polyphone disambiguation can be treated as a classification task. Based on a set of features extracted for a polyphone, its pronunciation is predicted from a set of candidates through a decision tree (DT) [10] or a maximum en- tropy (Maxent) model [11]. A study has shown that a Maxent model outperforms DT in polyphone disambiguation [11]. A learning-based approach can be integrated with a knowledge- based approach to form a hybrid approach [12], where most of the polyphones are disambiguated by a learned model, but the pronunciations of some polyphonic characters are determined by human rules. Since the pronunciation for most polyphonic character- s can be easily decided from their contexts, in this paper, we address the polyphone disambiguation task as a sequential la- beling (or tagging) task that models the important contextu- al information. Specifically, we propose to use bidirectional long short-term memory (BLSTM) neural network in pronun- ciation determination of polyphones in Mandarin Chinese. Our approach is motivated by recent tremendous success of BLST- M models in English G2P conversion [1] [13] and various se- quential learning tasks [14, 15, 16, 17]. LSTM uses specifically designed gates to control information flow and thus has excep- tional context modeling ability [18]. BLSTM is composed of a forward LSTM and a backward LSTM, and thus it can model both the past and the future contexts. However, the LSTM mod- els used in English G2P is not readily applicable to polyphone disambiguation in Mandarin. Firstly, besides the tremendous difference between the two languages, we shall determine the best context to be modeled in the task. Secondly, recent studies suggest that using multiple hidden layers can learn hierarchical features and boost performances. We would like to investigate whether using deep BLSTM architecture can benefit this task. Our results show that using a deep bidirectional LSTM is able to achieve state-of-the-art performance in polyphone disambigua- tion. 2. Features In Chinese, the pronunciation for a polyphonic character usually can be determined by the word that contains it and its neighboring words. For the examples in Table 1, we can eas-

2. Table 1: Examples showing that the word that contains the poly- phone and its POS tag are important features for polyphone disambiguation. The polyphone is in bold font. Word Š3 Š Š Pronunciation zhao1 chao2 zhao1 POS n ns n English translation morning and night Chaoyang (a place name) morning sun Figure 1: Recurrent neural network unrolled in time. ily get the pronunciation of polyphone ‘Š’, if the word that contains the polyphone is shown. Hence the identity of the word that contains the polyphone is an important feature. How- ever, in Table 1, we still cannot make a distinction between “Š(zhao1) ” and “Š(chao2) ”. But the two words can be discriminated (and the pronunciation of the polyphone ‘Š’ can be determined) if we have the part-of-speech (POS) informa- tion. If the word is tagged as ’ns’, then the polyphonic char- acter ‘Š’ in the word “Š ” is pronounced as ’chao2’. If tagged as ’n’, then ‘Š’ in the word “Š ” is pronounced as ’zhao1’. Thus POS tag is an important feature for polyphone disambiguation. Therefore, we simply employ the identity of the word that contains the polyphone and it’s POS tag as fea- tures. Table 2 shows that the left and right contexts are also quite useful for polyphone disambiguation. In this example, the char- acter ‘=’ can be used as two verbs with both different meanings and different pronunciations. But the pronunciations of the two ‘=’ can be easily discriminated if we observe their contexts (d- ifferent POS tags). This is also the reason that motivates us to use a recurrent network to modeling the contexts in the poly- phone disambiguation task. Table 2: Examples showing that contexts are useful for poly- phone disambiguation. The polyphone is in bold font. Sequence v = v ® ns Çi nz = v ©Ñ n Pronunciation zhuan4 zhuan3 English translation play around in Beijing convert Hanzi to Pinyin 3. Model 3.1. LSTM We treat polyphone disambiguation as a sequence tagging task. In the G2P conversion task [1, 2, 3, 4, 5, 6, 7], a neu- ral network accepts a sequence of characters and outputs a se- quence of pronunciations. While in polyphone disambiguation of Mandarin Chinese, the input is a sequence of characters with one or more polyphonic characters inside, and the outputs for the polyphonic character and other non-polyphonic character are the predicted pronunciation and a NULL symbol (‘-’), re- spectively, as shown in Table 3. Table 3: Treating polyphone disambiguation as a sequence tag- ging task. The input is a sequence of character. If the character is polyphone, the output is its predicted pronunciation; other- wise, the output is a NULl symbol ’-’. Input character Output pronunciation · − 3 − − Ñ du1 Q − Figure 2: A single LSTM memory cell with different gates to control information flow. Specifically, we use LSTM recurrent neural network (RN- N) to do the sequence tagging. Allowing cyclical connections in a feed-forward neural network, we obtain a recurrent net- work, as shown in Figure 1. RNNs, especially those network- s with LSTM cells, have recently produced promising results on a variety of tasks including language modeling [19] [20], speech recognition [21] and other sequential labelling tasks [14, 15, 16, 17, 22, 23, 24]. LSTM [18] uses purpose-built mem- ory cells to store information, which is designed to model a long range of context. LSTM is composed of a set of recurrently con- nected memory blocks and each block consists of one or more self-connected memory cells and three multiplicative gates, i.e., input gate, forget gate and output gate, as shown in Figure 2. The three gates are designed to capture long-range contextual information by using nonlinear summation units. For LSTM, the recurrent hidden layer function is implemented as follows: i t f t c t o t h t = σ(W xi x t + W hi h t−1 + W ci c t−1 + b i ) = σ(W xf x t + W hf h t−1 + W cf c t−1 + b f ) = f t c t−1 + i t tanh(W xc x t + W hc h t−1 + b c ) = σ(W xo x t + W ho h t−1 + W co c t + b o ) = o t tanh(c t ) where x t is the input feature vector; σ is the element-wise l- ogistic sigmoid function; i, f , o and c denote the input gate, forget gate, output gate and memory cell respectively, and all of them are the same size as the LSTM output vector h; W xi is the input-input gate matrix, W hc is the hidden-cell matrix, and so on; is the element-wise product. 3.2. Bidirectional LSTM One shortcoming of LSTM is that it is unidirectional: it only makes use of previous context. But in polyphone disam-

3. Figure 3: A bi-directional LSTM network reads “3 Ñ ÜS” for the forward directional LSTM, the time-revised se- quence “ÜS Ñ 3” for the backward directional LSTM. biguation, the past and future contexts are both important (as can be seen in Section 4.3). Thus bidirectional LSTM (BLST- M) architecture is used for polyphone disambiguation, as shown in Figure 3. BLSTM consists of a forward LSTM and a back- ward LSTM, and the outputs of the two sub-networks are then combined [21]. Given an input sequence (x 1 , x 2 , . . . , x n ), the forward LSTM reads it from left to right, but the backward L- STM reads it in a reversed order. These two networks have different parameters. BLSTM can utilize both past inputs and future inputs for a specific time. Figure 4: The flow digram of the LSTM-based polyphone dis- ambiguation system. tences from the Internet and manually label the pronunciations (i.e., Pinyin) of the 79 polyphonic characters appeared in these sentences. We manually divide the corpus into a training set with 167221 sentences (179410 polyphonic characters) and a test set with 7678 sentences (10500 polyphonic characters). 4.2. Experimental setups 3.3. Polyphone disambiguation using BLSTM Figure 4 shows the flow digram of the LSTM-based poly- phone disambiguation system. Given an input character se- quence, firstly we perform word segmentation and POS tagging. In this study, we put the word containing the polyphonic char- acter as the center of the sequence and consider the word’s left and right neighboring words as contexts. In the example shown in Figure 4, “ Ñ” is the centering word containing the poly- phone ‘Ñ’, while its left and right neighbors are words ‘3’ and “ÜS”. From the POS tag sequence, we then generate a token sequence, in which the center word is separated into several to- kens if it has multiple characters. In the example, “ Ñ n” is divided into “ n” and “Ñ n”. In the token sequence, the left and right words are treated as single tokens (do not sepa- rate if have multiple characters) and we only use their POS tags as features. We found that the POS tags of neighboring words are more useful for polyphone disambiguation. The token se- quence is then represented as a feature vector sequence. Each feature vector has a character identity sub-vector, a polyphone i- dentity sub-vector and a POS tag sub-vector. Finally the feature vector sequence is fed into the BLSTM network, resulting in a pronunciation sequence with the prediction of the polyphone. The network output is represented by a posterior vector which is composed of all possible pronunciations of the considered polyphones (in this study 79 polyphones with 186 pronuncia- tions) and a NULL label. We pick up the pronunciation with the highest posterior as the result. 4. Experiments 4.1. Dataset We choose 79 most frequently used polyphones for the polyphone disambiguation experiments. We crawl 174899 sen- We use the NLPIR toolkit [25] to perform the POS tagging on the text. The Kaldi toolkit [26] is adopted to implement the neural networks. To speed-up training, we use data parallelis- m with 512 character sequences per mini-batch. Our networks have hybrid structures with one feed-forward layer sitting on top of K BLSTM layers, where the best K is empirically se- lected through experiments. Our empirical experiments show that this network structure can ensure the training steadily con- verges. The number of nodes in each hidden layer is set to 512. The hidden activation function is sigmoid and the output lay- er activation function is softmax. We use cross-entropy as the error function in the training. The initial learning rate is set to 0.01, but we halve the learning rate if there is no improvement in the cross-entropy score. We also realize a joint n-gram ap- proach [2] (n=2) and a Maxent approach [11] for comparison. The training and test sets are kept the same with those in the BLSTM model training. 4.3. Modeling context We investigate the effects from different contextual inputs, and the results are summarized in Table 4. In this experiment, we use two LSTM/BLSTM layers (K = 2) and test differen- t contexts. BLSTM can model both past and future contexts, while forward and backward (unidirectional) LSTMs can mod- el past-only and future-only contexts, respectively. From the results, we can clearly see that, without the past and future in- puts (0 words), the accuracy of BLSTM degrades to the same level with the joint n-gram approach and unidirectional LSTMs have even lower accuracy. The best performances are achieved by the context of 1 words. But further expanding the context to 2 words results in clear accuracy degradations. BLSTM shows consistently better accuracy over unidirectional LSTMs. This observation shows that the use of both past and future contexts

4. Table 4: Polyphone disambiguation accuracy for LSTM with different contextual inputs. Context 0 words 1 words 2 words Forward LSTM 85.07 88.56 87.42 Backward LSTM 87.26 90.78 89.98 BLSTM 89.64 93.83 92.96 Table 5: Polyphone disambiguation accuracy for BLSTM with different layers. Layer 1 2 3 Acc (%) 93.71 93.83 93.73 is essential in polyphone disambiguation. 4.4. Number of BLSTM Layers We further test networks with different BLSTM layers. In this experiment, the input context is set to ±1 words. Result- s are shown in Table 5. We notice that one layer BLSTM can already achieve a good performance, but the best performance is achieved by a deeper network with two BLSTM layers. Fur- ther deepening the network to 3 layers has negative effect. We believe that this may be caused by the limited data. Approach Joint n-gram [2] Maxent [11] BLSTM Acc (%) 89.60 88.96 93.83 5. Conclusion In this paper, we address the polyphone disambiguation problem as a sequential labeling task. Specifically, we propose to use bidirectional long short-term memory (BLSTM) neural network to encode both the past and future observations on the character sequence as its inputs and predict the pronunciations. Our conclusions are as follows. 1) By modeling both past and future contexts of inputs, bidirectional LSTM significantly out- performs unidirectional LSTM in polyphone disambiguation. 2) A 2-layer BLSTM model achieves superior performance. 3) A finer granularity in POS tagging is able to lead to better per- formance. We have observed relative accuracy improvements of 4.7% and 5.5% as compared with the joint n-gram approach and the Maxent approach, respectively. In future, it may be in- teresting to investigate if a simple recurrent neural network can achieve similar performance. 6. Acknowledgements This work was supported by the National Natural Science Foundation of China( Grant No. 61571363). 4.5. POS tagging granularity As we discussed in Section 2, POS tags are critical fea- tures for polyphone disambiguation. So we examine the im- pacts from POS tagging granularity in the BLSTM-based poly- phone disambiguation task. In the experiments, we use a 2-layer BLSTM ( i.e., K=2) and the context of ±1 words. We study t- wo sets of POS tags as the network input. Table 6 provides the accuracy for two POS tagging tools, i.e., LTP [27] and NLPIR [25]. The LTP tagger outputs 28 different POS tags while the NLPIR outputs 90 different POS tags. We can clearly observe that the BLSTM with NLPIR tags outperforms the BLSTM with LTP tags. This means a finer granularity in POS tagging can lead to better polyphone disambiguation performance. Table 6: Polyphone disambiguation accuracy for two POS granularity (28 vs. 90) in the BLSTM-based approach. Tool (POS granularity) LTP (28) NLPIR (90) Table 7: Polyphone disambiguation accuracy for BLSTM and other approaches for comparison. Acc (%) 93.65 93.83 4.6. Comparison with other approaches Polyphone disambiguation accuracy of different approach- es is summarized in Table 7. In BLSTM, K is set to 2 and the context is set to ±1 words. From Table 7, we clearly see that the BLSTM approach significantly outperforms the joint n-gram approach [2] and the Maxent approach [11] ( The fea- ture is similar to BLSTM). The differences are significant at 95% confident level with paired t-tests. The relative accuracy improvements are 4.7% and 5.5% as compared with the joint n-gram approach and the Maxent approach, respectively. 7. References [1] K. Rao, F. Peng, H. Sak, and F. Beaufays, “Grapheme-to-phoneme conversion using long short-term memory recurrent neural net- works,” in ICASSP. IEEE, 2015, pp. 4225–4229. [2] J. R. Novak, N. Minematsu, and K. Hirose, “Failure transitions for joint n-gram models and g2p conversion.” in INTERSPEECH, 2013, pp. 1821–1825. [3] M. Bisani and H. Ney, “Joint-sequence models for grapheme-to- phoneme conversion,” Speech Communication, vol. 50, no. 5, pp. 434–451, 2008. [4] S. F. Chen et al., “Conditional and joint models for grapheme-to- phoneme conversion.” in INTERSPEECH, 2003. [5] L. Galescu and J. F. Allen, “Bi-directional conversion between graphemes and phonemes using a joint n-gram model,” in 4th IS- CA Tutorial and Research Workshop (ITRW) on Speech Synthesis, 2001. [6] J. R. Novak, N. Minematsu, K. Hirose, C. Hori, H. Kashioka, and P. R. Dixon, “Improving wfst-based g2p conversion with align- ment constraints and rnnlm n-best rescoring.” in INTERSPEECH, 2012, pp. 2526–2529. [7] D. Caseiro, I. Trancoso, L. Oliveira, and C. Viana, “Grapheme-to- phone using finite state transducers,” in Proc. 2002 IEEE Work- shop on Speech Synthesis, vol. 2, 2002, pp. 1349–1360. [8] Z.-R. Zhang, M. Chu, and E. Chang, “An efficient way to learn rules for grapheme-to-phoneme conversion in chinese,” in Inter- national Symposium on Chinese Spoken Language Processing, 2002. [9] D. Gou and W. Luo, “Processing of polyphone character in chi- nese tts system,” Chinese Information, no. 1, pp. 33–36. [10] F. Liu and Y. Zhou, “Polyphone disamblgaatlou based on tree- guided tbl,” Computer Engineering and Application, vol. 47, no. 12, pp. 137–140, 2011.

5. [11] F. Liu, Q. Shi, and J. Tao, “Maximum entropy based homograph disambiguation,” in NCMMSC2007, 2007. [12] M. Fan, G. Hu, and R. Wang, “Multi-level polyphone disambigua- tion for mandarin grapheme-phoneme conversion,” Computer En- gineering and Application, vol. 42, no. 2, pp. 167–170, 2006. [13] K. Yao and G. Zweig, “Sequence-to-sequence neural net models for grapheme-to-phoneme conversion,” in Sixteenth Annual Con- ference of the International Speech Communication Association, 2015. [14] K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi, “Spo- ken language understanding using long short-term memory neural networks,” in SLT, 2014 IEEE. IEEE, 2014, pp. 189–194. [15] Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm-crf models for sequence tagging,” arXiv preprint arXiv:1508.01991, 2015. [16] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6645–6649. [17] T. Mikolov and G. Zweig, “Context dependent recurrent neural network language model.” in SLT, 2012, pp. 234–239. [18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [19] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudan- pur, “Recurrent neural network based language model.” in Inter- speech, vol. 2, 2010, p. 3. [20] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Černockỳ, “S- trategies for training large scale neural network language models,” in ASRU, 2011 IEEE Workshop on. IEEE, 2011, pp. 196–201. [21] A. Graves and J. Schmidhuber, “Framewise phoneme classifica- tion with bidirectional lstm and other neural network architec- tures,” Neural Networks, vol. 18, no. 5, pp. 602–610, 2005. [22] K. Yao, B. Peng, G. Zweig, D. Yu, X. Li, and F. Gao, “Recurrent conditional random field for language understanding,” in ICASSP. IEEE, 2014, pp. 4077–4081. [23] C. Ding, L. Xie, J. Yan, W. Zhang, and Y. Liu, “Automatic prosody prediction for chinese speech synthesis using blstm-rnn and em- bedding features,” in ASRU. IEEE, 2015, pp. 98–102. [24] O. Tilk and T. Alumäe, “Lstm for punctuation restoration in speech transcripts,” in Sixteenth Annual Conference of the Inter- national Speech Communication Association, 2015. [25] L. Zhou and D. Zhang, “Nlpir: A theoretical framework for apply- ing natural language processing to information retrieval,” Journal of the American Society for Information Science and Technology, vol. 54, no. 2, pp. 115–123, 2003. [26] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recog- nition toolkit,” in IEEE 2011 Workshop on ASRU. IEEE Signal Processing Society, Dec. 2011, iEEE Catalog No.: CFP11SRW- USB. [27] W. Che, Z. Li, and T. Liu, “Ltp: A chinese language technology platform,” in Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations. Association for Computational Linguistics, 2010, pp. 13–16.