Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT

如果无法正常显示，请先停止浏览器的去广告插件。

1. INTERSPEECH 2019 September 15–19, 2019, Graz, Austria Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT Dongyang Dai 1,2 , Zhiyong Wu 1,2,3,⇤ , Shiyin Kang 4 , Xixin Wu 3 Jia Jia 1,2 , Dan Su 4 , Dong Yu 4 , Helen Meng 1,3 Tsinghua-CUHK Joint Research Center for Media Sciences, Technologies and Systems, Graduate School at Shenzhen, Tsinghua University, Shenzhen, China 2 Beijing National Research Centre for Information Science and Technology (BNRist), Department of Computer Science and Technology, Tsinghua University, Beijing, China 3 Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China 4 Tencent AI Lab, Tencent, Shenzhen, China 1 ddy17@mails.tsinghua.edu.cn, {zywu,wuxx,hmmeng}@se.cuhk.edu.hk {shiyinkang,dansu,dyu}@tencent.com, jjia@tsinghua.edu.cn Abstract N Grapheme-to-phoneme (G2P) conversion serves as an essential component in Chinese Mandarin text-to-speech (TTS) system, where polyphone disambiguation is the core issue. In this paper, we propose an end-to-end framework to predict the pronuncia- tion of polyphonic character, which accepts sentence containing polyphonic character as input in the form of Chinese charac- ter sequence without the necessity of any preprocessing. The proposed method consists of a pre-trained bidirectional encoder representations from Transformers (BERT) model and a neu- ral network (NN) based classifier. The pre-trained BERT model extracts semantic features from raw Chinese character sequence and the NN based classifier predicts the polyphonic character’s pronunciation according to BERT output. To explore the impact of contextual information on polyphone disambiguation, three different classifiers are investigated: a fully-connected network based classifier, a long short-term memory (LSTM) network based classifier and a Transformer block based classifier. Ex- perimental results demonstrate the effectiveness of the proposed end-to-end framework for polyphone disambiguation and the semantic features extracted by BERT can greatly enhance the performance. Index Terms: polyphone disambiguation, pre-trained BERT, end-to-end framework Chinese Character Pronunciations Phoneme sequence Polyphone Disambigution Model Figure 1: Chinese G2P conversion flow look up the dictionary to derive its pronunciation. Otherwise, we need a polyphone disambiguation model to predict its pro- nunciation based on its context information. For Chinese polyphonic characters, their pronunciations are affected by the semantic context information [1] of neighboring characters that may occur before or after the polyphonic charac- ter with different spans. The earliest Chinese polyphone disam- biguation system was based on manual rules [2, 3]. The laws for polyphone disambiguation were summarized by linguistic experts and written into computer-understandable forms. How- ever, as the number of rules increases, a polyphonic character may be matched by multiple conflicting rules at the same time. As the amount of data increases, more and more researchers tried to use statistical approaches for polyphone disambigua- tion. Decision trees were applied in [4] to classify the pronun- ciations of polyphonic characters. A study in [5] has shown that a Maxent model outperforms decision tree. However, these statistical approaches need handcrafted features extracted from sentence containing the polyphonic character as model’s input. Feature engineering requires linguistic background knowledge and is expensive. Deep neural network (DNN), which can learn high-level in- variant features from raw data [6], provides an easier way to ex- tract valid features and predict the pronunciation of polyphonic character. Shan et al. [7] addressed the polyphone disambigua- tion problem as a sequential labeling task, and proposed a bidi- rectional long short-term memory (BLSTM) approach to pre- dict the pronunciation of polyphonic character which outper- forms the max entropy model. This approach encodes the poly- phonic character’s surrounding observations including neighbor characters and part-of-speech (POS) of neighbor words. Only word tokenization and POS tagging are required in the prepro- cessing stage in Shan’s approach, greatly reducing the work of 1. Introduction Text-to-speech (TTS) technology has been widely used in voice-assistants, car navigation, e-books and other products. For language based on graphic symbols like Chinese, it is nec- essary to convert the input character sequence into phoneme se- quence before synthesizing speech. Therefore the grapheme-to- phoneme (G2P) conversion component is essential in Mandarin TTS system. A Chinese character may have multiple corresponding pro- nunciations, which is called polyphonic character. Polyphone disambiguation which predicts the correct pronunciation of a polyphonic character is the core issue in Chinese G2P conver- sion. Fig.1 depicts the flow of Chinese G2P conversion. If the input character is not a polyphonic character, we can directly * Corresponding author Copyright © 2019 ISCA Polyphonic Character? Y Context Information Pronunciation Dictionary 2090 http://dx.doi.org/10.21437/Interspeech.2019-2292

2. Semantic Feature Sequence Labels(Pronunciatians) Transformer Block NN based classiﬁer Add & Norm Feed Forward Semantic Feature Sequence N× Add & Norm Pre-tained BERT Multi-Head Attention Character Sequence Figure 2: Model architecture Embedding Sequence Add feature engineering. However, since this model is trained on limited annotation data, it is difficult to learn enough seman- tic information for polyphone disambiguation. Besides, all the considered polyphonic characters share the same classifier with only one output layer listing the labels of all possible pronun- ciations, can neither avoid the circumstance of mis-predicting to the pronunciation of another polyphonic character, nor can it handle new polyphonic characters that have not yet appeared without changing the output layer and retraining of the shared classifier. With the development of neural network research, end-to- end TTS has become a new trend [8, 9, 10]. Towards end-to-end G2P conversion in Chinese Madarin, we propose an end-to-end framework for polyphone disambiguation consisting of a pre- trained BERT [11] and neural network (NN) based classifier. The advantage of proposed method is summarized as follows: Character Embedding Layer Positional Embedding Layer Character Sequence Figure 3: BERT architecture character sequence respectively before getting the embedding sequence. The following Transformer blocks convert embed- ding sequence to semantic feature sequence. Because the use of Transformer [12] and BERT have been ubiquitous, the structure of Transformer block will not be described in detail here. The BERT model is pre-trained on a large amount of unla- beled data with two prediction tasks, predicting the masked in- put characters and predicting the next sentence. The pre-trained BERT model is expected to learn semantic representings from raw character sequence. 1. The proposed method predicts pronunciation in an end- to-end way, accepting raw Chinese character sequence containing polyphonic character as input, without the ne- cessity of any preprocessing procedures. 2.2. The NN based classifier 2. A large amount unsupervised data can be adopted to pre-train the model for extracting semantic information, which will boost the performance of polyphone disam- biguation. As the pre-trained BERT extracts semantic features from raw character sequence and the pronunciation of a polyphonic char- acter is determined by its contextual semantics, we can directly predict a polyphonic character’s pronunciation according to these semantic features. We assume that the polyphonic charac- ter is the ith element of the BERT input sequence, the NN based classifier predicts pronunciation based on BERT output and the value of the subscript i. We explored fully-connected network based classifier, LSTM based classifier and Transformer block based classifier in our research. 3. The proposed method uses non-shared output layer among different polyphonic characters, eliminating the case of mis-predicting to pronunciation of other poly- phonic characters. Furthermore, with this architecture, when new polyphonic characters are required to be pro- cessed, only output layers for these new characters are added and trained without affecting the existing classi- fiers. 2.2.1. Fully-connected network based classifer First of all, we use a two-layer fully-connected network to pre- dict the pronunciation of polyphonic character according to the ith element of the BERT output sequence. The fully-connected network based classifier is depicted in Fig.4 (a). The first fully- connected layer is shared by all the polyphonic characters. As for the second fully-connected (output) layer, it is not shared. Each polyphonic character has a separate output layer whose units number is equal to the number of possible pronunciations. Softmax cross-entropy loss is adopted to train the classifier. The LSTM based classifier and Transformer block based classifier also use the same structure of output layer and loss function. 2. The proposed approach The proposed framework consists of a pre-trained BERT and NN based classifier. Depicted in Fig.2, the pre-trained BERT extracts semantic features from a raw Chinese character se- quence containing polyphonic character, the following NN based classifier predicts polyphonic character’s pronunciation according to BERT output. In our research, we explored the performance of classifiers based on fully-connected network, BLSTM and Transformer block respectively. 2.1. The pre-trained BERT 2.2.2. LSTM based classifer The pre-trained BERT accepts raw Chinese character sequence as input and outputs a sequence of semantic features. The BERT’s architecture is shown in Fig.3, the character embed- ding layer and positional embedding layer process the input Indicated by [7], contextual information such as the POS of polyphone’s neighbor words can also affect the pronunciation of a polyphonic character. So instead of classifying according 2091

3. Pronunciations … Pronunciations Pronunciations … … … Pronunciations Pronunciations … … … Pronunciations … … … … … ! ! ! ! !! ! ! ! ! !! … Transformer Block Output LSTM Output ! ! ! ! !! BERT Output BLSTM Transformer Block BLSTM Transformer Block BERT Output BERT Output (a) (b) (c) Figure 4: The NN based classifiers, (a) fully-connected network, (b) LSTM, (c) Transformer block to only the ith element of BERT output sequence directly, we apply bidirectional LSTM (BLSTM) to model the contextual information before classifying. The LSTM based classifier is shown in Fig.4 (b). The BERT output sequence is processed by a two-layer BLSTM network first to model the contextual in- formation, then a following unshared output layer predicts the pronunciation of corresponding polyphonic character according to the ith element of LSTM output sequence. character. The POS sequence considers the neighbor words be- sides the word containing polyphonic character. Fig.5 depicts the LSTM baseline approach. The prediction of pronunciation is viewed as a sequence labeling task. The baseline model accepts embedding sequence concatenated by character embedding sequence and POS embedding sequence, and outputs a label sequence corresponding to the input charac- ters. In our experiments, we set the hidden units of BLSTM to 512, the number of BLSTM layers to 2 and the contextual size to 1 when constructing POS embedding sequence, identical to the setting in [7]. 2.2.3. Transformer block based classifier Due to the characteristics of the recurrent network, nearby lo- cations have greater impacts than farther locations. In order to better analyze the impact of context information, we use Trans- former block to model context information. From the perspec- tive of model structure, information at any position is equally important in Transformer block. The Transformer block based classifier is depicted in Fig.4 (c). Two-layer Transformer blocks model the contextual information on the BERT output, the fol- lowing unshared output layer accepts the ith element of Trans- former block as input and predicting the pronunciation of cor- responding polyphonic character. Label Sequence Position-wise Dense LSTM Output BLSTM BLSTM Embedding Sequence Concatenate 3. Experiment and analysis Character Embedding Sequence 3.1. Dataset POS Embedding Sequence Figure 5: LSTM baseline approach for polyphone disambigua- tion The experiments were conducted on a dataset extracted from TTS corpus in Tencent AI Lab. There are 331,325 sentences containing polyphonic characters in the corpus. We selected polyphonic characters which appear in more than 2,000 sen- tences accounting for 83.7% of the total polyphonic samples. In our experiments, the dataset was divided into 10 subsets ran- domly keeping the distribution of polyphonic characters, 8 sub- sets were used for training, one subset was used as development set and the remaining subset as test set. We conducted 10-fold cross-validations to get the final average result. 3.3. Settings of the proposed approach In our experiments, we adopted the pre-trained BERT model provided by Google to extract semantic features from raw Chi- nese character sequence 1 , and the detail of BERT is identical to the BERT BASE model described in [11] whose output size of Transformer block is 768. Due to limited data, fine-tuning the pre-trained BERT model did not achieve desired results. So during the training phase, we froze the parameters of the BERT model and only updated the parameters of the NN based classi- fier. In the fully-connected network based classifier, the hidden units of the first fully-connected layer is 512 with a dropout rate of 0.5 during training phase. The hidden units of BLSTM is 512 in LSTM based classifier. In Transformer block based classifier, the dimension of Transformer block is 512, and the head num- 3.2. Settings of baseline approach We took Shan’s LSTM approach for polyphone disambiguation in [7] as baseline (LSTM baseline). The LSTM baseline ap- proach needs word tokenization and POS tagging first on the input character sequence. Then the LSTM based model accepts a character embedding sequence and a contextual POS embed- ding sequence as input. The character embedding is generated from characters composing the word that contains polyphonic 1 https://github.com/google-research/bert 2092

4. Figure 6: Experimental results: accuracies of defferent meth- ods, (1) LSTM baseline, (2) BERT + FC, (3) BERT + LSTM, (4) BERT + Transformer block (a) (b) Figure 7: PCA embedding of semantic features corresponding to typical Chinese character extracted by pre-trained BERT ber is 8 in multi-head attention which is identical to settings in [12]. During the training phase, we adopted the Adam[14] op- timizer and set the learning rate to 5e-4, except the Transformer block based classifier. When training the classifier based on Transformer block, we used the training strategy described in [12]. 3.4. Experimental results and analysis The experimental results are depicted in Fig.6. All our proposed methods based on pretrained BERT outperform the LSTM base- line. Besides, BERT + LSTM method and BERT + Trans- fomer method, which model the context of BERT output, have achieved better results than the BERT + FC method. In par- ticular, BERT + LSTM method achieves the highest prediction accuracy. Figure 8: Attention average weights cropped around polyphonic character 4. Conclusions Fig.7 shows the PCA embedding of features correspond- ing to Chinese character “” and “0” extracted by pre- trained BERT. “” and “0” are typical Chinese characters with several distinct meanings. “” mainly has two distinct meanings, which are “grow” (pronounced as “zhang3”) and “long”(pronounced as “chang2”), and the PCA embedding in the Fig.7 (a) also mainly merges into two parts. Besides, there are two main usages of Chinese character “0”, one stands for the “ground” (pronounced as “di4”) and the other is to connect adverbials and predicates (pronounced as “de0”), corresponding to PCA embedding grouped in two parts in Fig.7 (b). Figure 7 illustrates that pre-trained BERT on a large amount of unlabeled data can extract valid semantic features. And supported by ex- perimental results, these semantic features are useful to enhance the performance of pronunciation prediction. In this paper, we proposed an end-to-end framework for Chinese polyphone disambiguation. The proposed framework accepts raw Chinese character sequence as input without the necessity of any preprocessing procedures such as word tokenization or POS tagging, which consists of a pre-trained BERT model for extracting semantic features from raw Chinese characters and a NN based classifier for predicting the polyphonic character’s pronunciation from BERT output. To explore the impact of contextual information on poly- phone disambiguation, three different NN based classifiers are investigated and compared with the LSTM baseline approach [7]. Experimental results demonstrate the effectiveness of the proposed end-to-end framework and that the BERT model pre- trained on a large amount of unsupervised data can effectively extract semantic features, which greatly enhances the perfor- mance of polyphone disambiguation. Meanwhile, the contex- tual information can also improve the result of polyphone dis- ambiguation, especially the closer the context is to the poly- phonic character, the greater its influence on polyphone disam- biguation. To further illustrate the impact of contextual information for polyphone disambiguation, we draw the attention weights cropped around the polyphonic character. Fig.8 shows the aver- age cropped attention weights of all the heads in the first Trans- former block in classifier on test set. The attention weights is cropped around the polyphonic character with neighboring con- textual size 5, which means the size of the cropped attention weight is (11, 11) and the location (5, 5) corresponding to the polyphonic character. It can be seen from Fig.8 that the closer to the position of the polyphonic character, the greater the weight of the attention, which means more adjacent contextual features have a greater effect in polyphone disambiguation task. This explains the reason why the LSTM based classifier performs better than the Transformer block based one, LSTM can better preserve adjacent contextal information due to the characteris- tics of the recurrent network. 5. Acknowledgements This work is supported by joint research fund of National Nat- ural Science Foundation of China - Research Grant Council of Hong Kong (NSFC-RGC) (61531166002, N CUHK404/15), National Natural Science Foundation of China (61521002, 61433018, 61375027). We would also like to thank Tencent AI Lab Rhino-Bird Focused Research Program (No. JR201942) and Tsinghua University - Tencent Joint Laboratory for the sup- port. 2093

5. 6. References [1] L. Li, “The status quo of multi-syllable words and regular method of recognizing,” Economic and social development, vol. 8, no. 7, pp. 125–128, 2010. [2] D. Zhang, “Research and implementation of key techniques of chinese language switching system hj-tts,” Ph.D. dissertation, Graduate School of the Chinese Academy of Sciences (Institute of Computing Technology), 2000. [3] L. Cai, H. Wei, and X. Zhou, “Linguistic processing in chi- nese text-to-speech conversion,” Chinese Journal of Information, vol. 9, no. 1, pp. 31–36, 1995. [4] W. Wang, S. Hwang, and S. Chen, “The broad study of homo- graph disambiguity for mandarin speech synthesis,” in Proceeding of Fourth International Conference on Spoken Language Process- ing. ICSLP’96, vol. 3. IEEE, 1996, pp. 1389–1392. [5] F. Liu, Q. Shi, and J. Tao, “Maximum entropy based homo- graph disambiguation,” in The 9th National Conference on Man- Machine Speech Communication, 2007. [6] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013. [7] C. Shan, L. Xie, and K. Yao, “A bi-directional lstm approach for polyphone disambiguation in mandarin chinese,” in 2016 10th In- ternational Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, 2016, pp. 1–5. [8] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, “Char2wav: End-to-end speech syn- thesis,” 2017. [9] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017. [10] N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. Zhou, “Close to human quality tts with transformer,” arXiv preprint arXiv:1809.08895, 2018. [11] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- training of deep bidirectional transformers for language under- standing,” arXiv preprint arXiv:1810.04805, 2018. [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł . Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008. [13] S. Li, Z. Zhao, R. Hu, W. Li, T. Liu, and X. Du, “Analogical rea- soning on chinese morphological and semantic relations,” arXiv preprint arXiv:1805.06504, 2018. [14] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” arXiv preprint arXiv:1412.6980, 2014. 2094