A Mask-based Model for Mandarin Chinese Polyphone Disambiguation
如果无法正常显示,请先停止浏览器的去广告插件。
1. A Mask-based Model for Mandarin Chinese Polyphone Disambiguation
Haiteng Zhang, Huashan Pan, Xiulin Li
Databaker (Beijing) Technology Co., Ltd, Beijing, China
{zhanghaiteng, panhuashan, lixiulin}@data-baker.com
Abstract
Polyphone disambiguation serves as an essential part of Man-
darin text-to-speech (TTS) system. However, conventional sys-
tem modelling the entire Pinyin set causes the case that predic-
tion belongs to the unrelated polyphonic character instead of the
current input one, which has negative impacts on TTS perfor-
mance. To address this issue, we introduce a mask-based model
for polyphone disambiguation. The model takes a mask vector
extracted from the context as an extra input. In our model, the
mask vector not only acts as a weighting factor in Weighted-
softmax to prevent the case of mis-prediction but also elimi-
nates the contribution of non-candidate set to the overall loss.
Moreover, to mitigate the uneven distribution of pronunciation,
we introduce a new loss called Modified Focal Loss. The ex-
perimental result shows the effectiveness of the proposed mask-
based model. We also empirically studied the impact of
Weighted-softmax and Modified Focal Loss. It was found that
Weighted-softmax can effectively prevent the model from pre-
dicting outside the candidate set. Besides, Modified Focal Loss
can reduce the adverse impacts of the uneven distribution of
pronunciation.
Index Terms: polyphone disambiguation, mask vector,
Weighted-softmax, Modified Focal Loss
1. Introduction
Mandarin G2P (Grapheme-to-phoneme) module serves to pre-
dict corresponding Pinyin sequence for characters, which con-
sists of polyphone disambiguation, tonal modification and ret-
roflex suffixation [1], etc. Polyphone disambiguation, aiming to
predict the correct pronunciation of the given polyphonic char-
acters, is an essential component of Mandarin G2P conversion
system. According to the research [1, 2, 3, 4], the difficulty of
Mandarin polyphone disambiguation mainly lies in heteronyms.
Their pronunciations cannot be determined simply by the word
itself but require more lexical information and contextual infor-
mation,such as Chinese word segmentation, POS (part of
speech) tagging, syntactic parsing and semantics.
The earliest approaches of polyphone disambiguation
mainly relied on dictionary and rules. The pronunciations of
polyphonic characters were decided by a well-designed diction-
ary and some rules crafted by linguistic experts [1, 2]. However,
the rule-based method requires a massive investment of labor
to build and maintain a robust dictionary. As the amount of data
increased, statistical methods were later widely applied in pol-
yphone disambiguation. Experimental results have confirmed
the competency of the statistical methods such as Decision trees
(DT), Maximum Entropy (ME) to achieve reasonable perfor-
mance [3, 4]. However, statistical approaches also ask for con-
siderable effort for feature engineering.
The Recent tremendous success of the neural network in
various fields has driven polyphone disambiguation to turn to
neural network-based models. [5] addressed the task as se-
quence labelling and adopted bidirectional long-short-term
memory (BSLTM) architecture to predict the pronunciation of
the input polyphonic characters, which proved that the BLSTM
could benefit the task. [6] combined multi-granularity features
as input and yielded improvement on the task. The recent emer-
gence of pre-trained model [7-11] made researchers set out to
look at polyphone disambiguation based on these models. With
the powerful semantic representation, the pre-trained model
helps the system to achieve better performance. Bidirectional
encoder representations from Transformer (BERT) was applied
in front-end of Mandarin TTS system and showed that the pre-
trained model outperforms previous methods [12]. Transformer
based neural machine translation (NMT) encoder also has a
positive effect on the task [13]. However, to avoid the case of
prediction belongs to the unrelated polyphonic character rather
than the current input one, it is either to model each polyphonic
character separately or to uniformly model the entire Pinyin set
but adding limitation in the output layer. Yet, the drawback of
the former is complex maintenance due to its large number of
models, while the latter only limits the prediction output but ig-
nores the impact of the restriction on other modules in the train-
ing process. Besides, the unbalanced distribution among poly-
phones also harmful to the task.
To address these issues, we propose a mask-based architec-
ture for Mandarin polyphone disambiguation by employing a
mask vector. In the proposed framework, features including
mask vector are taken as input. Then, we apply an encoding
layer composed of BLSTM and convolutional neural network
(CNN) to obtain semantic features. The Weighted-softmax is
latter used to pick up the pronunciation for the polysyllabic
character. In the proposed model, the roles that mask vector
plays can be concluded as follows 1) Mask vector enriches the
input features. 2) Mask vector acts as a weighting factor in
Weighted-softmax to prevent the model from mis-predicting
the Pinyin of other polyphonic characters. 3) Constraints of can-
didates by mask vector will pass to the calculation of loss func-
tion then better guide the training process. In this way, the pro-
posed approach not only can model the entire polyphonic char-
acters set within one model but also eliminates the case of mis-
prediction without harming the training process. Specifically, to
mitigate the uneven distribution of pronunciation among poly-
phonic characters, we introduce a new loss function called
Modified Focal Loss. Our experiments demonstrate that the
proposed approach can avoid predicting outside the candidate
set and ease the imbalanced distribution without harming the
performance.
The organization of this paper is listed as follow. Section 1
reviews the background of polyphone disambiguation. Section
2 introduces various input features of our model. Section 3
briefs on our model structure. Section 4 presents the experi-
mental details and results of this thesis. Section 5 gives the con-
clusions and looks to future research prospects.
2. • Chinese Character (CC): All character within corpus,
including monophonic characters and polyphonic charac-
ters;
• Chinese Word Segmentation (CWS): Word segmenta-
tion results at the character level, which are represented
by {B, M, E, S} tags;
• Part of Speech (POS): We perform POS tagging toward
input sentence and assign the tag into character level;
• Polyphones (PP): A collection of all polyphonic charac-
ters within the corpus along with a non-polyphone token;
• Flag Token (Flag): The value range is {0, 1, 2}. Each
respectively denotes current char that is disambiguation-
needed, disambiguation-needless, and monosyllable;
• Mask Vector (Mask): The dimension of mask vector
equals to the length of the Pinyin set plus with two special
tokens “<UN_LABEL>” and “<NO_LABEL>”. The for-
mer token denotes monophonic characters while the latter
one denotes the polyphonic characters that do not require
disambiguation;
In the example sentence “仅会在行业规范和会计制度方
面进行指导”(It will only provide guidance in occupational
standards and accounting system. ), we assumed only a part of
polyphonic characters in the sentence would be labelled: The
first “会” (target candidate set is [hui4, kuai4], the correct pro-
nunciation is hui4), “行” (target candidate set is [hang2, xing2],
the correct pronunciation is hang2) and “和” (target candidate
set is [he2, he4, huo4, huo2, hu2], the correct pronunciation is
he2); and other polyphonic characters such as the second “会”
and the second “行” would not be labelled. Relevant input fea-
tures of the example sentence are shown in Figure 1.
hui4
kuai4
hang2
xing2
he2
he4
huo4
huo2
hu2
<UN>
<NO>
会 在 行 业
S S B M
v p n n
会 O 行 O
0 2 0 2
1 0 0 0
1 0 0 0
0 0 1 0
0 0 1 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 1 0 1
0 1 0 1
规
M
n
O
2
0
0
0
0
0
0
0
0
0
1
1
范
E
n
O
2
0
0
0
0
0
0
0
0
0
1
1
和 会 计 制
S B M M
c n n n
和 会 O O
0 1 2 2
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
0 1 1 1
0 1 1 1
度
E
n
O
2
0
0
0
0
0
0
0
0
0
1
1
方
B
n
O
2
0
0
0
0
0
0
0
0
0
1
1
面
E
n
O
2
0
0
0
0
0
0
0
0
0
1
1
进
B
v
O
2
0
0
0
0
0
0
0
0
0
1
1
行 指 导
E B E
v v v
行 O O
1 2 2
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 1 1
1 1 1
Figure 1: Input features of the given sample sentence.
3. Mask-based Mandarin Chinese poly-
phone disambiguation model
Figure 2 depicts the proposed model’s architecture which is
mainly composed of three parts as below:
1. Character-level Feature Embedding Layer:
This layer serves to integrate various input features
accompanying the mask vector into a low-dimen-
sional and dense vector. First, multiple features are
converted into a one-hot label respectively that will
be later transformed into an embedding vector by
FNN (Feedforward Neural Network). Then, different
features’ embeddings are concatenated and trans-
formed into a fixed-length vector by MFNN (Multi-
layers Feedforward Neural Network).
2. Context Features Encoding Layer:
Accepting a sequence of vectors from the character-
level feature embedding layer, this module first ex-
tracts semantic information of sentence by both
BLSTM and 1D-CNN. FNN layers then intergrade
obtained context sequence into a dense vector to rep-
resent each word inside the sentence. The reason that
motivates us to utilize both BLSTM and 1D-CNN to
jointly encode contextual information is mainly based
on the following considerations: 1) The BLSTM has
an elegant way of encoding sentence-level infor-
mation. This is extremely helpful when it comes to
tasks that need long-range context. 2) 1D-CNN is ef-
fective in extracting n-grams level contextual features
that are critical for the task of polyphone disambigu-
ation [14, 15].
3. Restricted Output Layer:
To restrict the target candidate set of the current input
polyphonic character, the restricted output layer ap-
plies the Weighted-softmax by combining the mask
vector with softmax to pick up the highest probability
within the candidate set. In addition, the proposed
model adopts Modified Focal Loss rather than cross-
entropy as loss function.
In this work, we explore the Weighted-softmax and Modi-
fied Focal Loss modules in terms of improving the performance
of polyphone disambiguation.
仅
S
d
O
2
0
0
0
0
0
0
0
0
0
1
1
仅/d 会/v 在/p 行业规范/n 和/c 会计制度/n 方面/n 进行/v 指导/v
According to research [1-4], features such as Chinese word seg-
mentation, POS tagging, and contextual information are essen-
tial to the task. Therefore, we apply the Chinese characters of
the input sentence and the corresponding lexical information,
such as Chinese word segmentation and POS tagging, as input
features. Similar to [5], we utilize a flag token to identify
whether the current input character is disambiguation-needed or
not. Meanwhile, the polysyllabic characters, apart from Chinese
characters features, also have an extra feature to enhance the
information provided. Additionally, we adopt a mask vector to
restrict the relationship between polyphonic characters and their
relevant Pinyin candidates set. The mask vector which consists
of Boolean value denotes the related Pinyin of the input poly-
phonic character. For instance, the polyphonic character “会”
can be pronounced as “hui4” and “kuai4”, and the correspond-
ing pronunciations in mask vector would be assigned a value 1
while the other pronunciations in mask vector would be as-
signed a value 0 respectively. We add two additional tokens in
the mask vector to indicate the monophonic character and unla-
belled polyphonic characters. The mask vector here enriches the
input features. Besides, it also acts as a weighting factor in
Weighted-softmax, which will be described in Session 3.
Finally, we convert the characters sequence to embedding
as the model input, along with the auxiliary features mentioned
above from the corresponding sentence.
In summary, the proposed model uses a total of six features,
including Chinese character, Chinese word segmentation, POS
tagging, polyphones, flag token, and mask vector. Details of the
various features are described below: 仅会(hui4)在行(hang2)业规范和(he2)会计制度方面进行指导
2. Features
3. Pinyin Prediction
FNN
Concatenation
Bi-LSTM
1D-CNN
MFNN
Concatenation
FNN
FNN
FNN
FNN
FNN
FNN
ONE-HOT ONE-HOT ONE-HOT ONE-HOT ONE-HOT ONE-HOT
CC
CWS
POS
PP
FLAG
MASK
Weighted-softmax
The equation of Focal Loss is as follows:
??(? ? ) = −(1 − ? ? ) ? log(? ? )
where ? ? denotes the model’s predicted probability for the
true label and value range is [0, 1]. We propose to add the
confidence parameter to Focal Loss and the proposed Modified
Focal loss is defined as follows:
???(? ? ) = −(1 + ? − ? ? ) ? log(? ? )
where mask vector is denoted as ? = {? 1 , ? 2 , . . . , ? ? } and
? ? is a Boolean value to denote whether to mask element ? ? .
By Weighted-softmax, we can assure that the probability of
non-candidate pronunciation will not be allocated, and the sum
of the probability assigned by the candidate pronunciation set is
equal to 1. In this way, we can effectively prevent the model
from predicting Pinyin outside the candidate set. Moreover,
when calculating loss, Weighted-softmax eliminates the influ-
ence of the non-candidate Pinyin set brought to models, thus
focusing on the candidate Pinyin set.
3.2. Modified Focal Loss
Due to the uneven distribution of Pinyin, attaching excessive
attention to massive and easily classified examples makes the
model less precise in terms of rare and hard classified examples,
thereby degrading the performance of the system. Concerning
uneven distribution in pronunciation among polyphones, in-
spired by [16], we introduce a novel loss named Modified Focal
Loss (MFL) by adding a tunable confidence parameter ? to
Focal Loss. In Modified Focal Loss, ? serves as a threshold to
distinguish between massive/easy examples and rare/hard ex-
amples, thereby down-weighting the contribution of the former
one and up-weighting that of the latter. In this way, Modified
Focal Loss enables the model to better classify rare and hard
examples.
(3)
both ? and ? are hyper-parameters, ? denotes the tunable
confidence parameter and value range is (0.0, 1.0) ; ?
denotes the tunable focusing parameter and value range is
(0, +∞ ). When the system’s estimated probability for the true
pronunciation is greater than ?, the current input polyphonic
character is considered to be easy to classify, and the loss of the
corresponding sample will be down-weighted to the overall loss.
On the contrary, the input polyphonic character is considered to
be difficult to classify, and its loss to the overall loss will be
enhanced.
Figure 2: Network architecture of the proposed model.
3.1. Weighted-softmax
For each polyphonic character, its candidate range of pronunci-
ation is very limited, which only occupies a small part of the
entire Pinyin set. According to softmax, each Pinyin would as-
sign a non-zero probability, leading to the sum of the probabil-
ities obtained in the candidate set is less than 1. It would arouse
additional errors, thus making the overall loss larger, which
would in turn produce a negative influence on the training pro-
cess. To address this issue, we constructed the Weighted-soft-
max by regarding the mask vector as the weighting factor in
softmax.
Supposed the input vector of softmax is ? = {? 1 , ? 2 ,
. . . , ? ? } and ? ? represent the ? ?ℎ element of vector ?. For
Weighted-softmax, the probability of each element is imple-
mented as follows:
? ? ∗ ? ? ?
? ? = ?
(1)
∑ ? = 1 ? ? ∗ ? ? ?
(2)
4. Experiment
4.1. Dataset
To verify the proposed method, the experiments were con-
ducted on the dataset from DataBaker 1 . In the corpus, there are
692,357 sentences, and each one at least contains one poly-
phonic character. We split the dataset into a training set with
623,320 sentences and a test set with 69,037 sentences. Table 1
illustrates the statistical information of corpus.
Table 1: Statistical information of corpus.
Character Polyphone Training set Test set
量 liang4
liang2 5,402
156 571
20
dang1
dang4
xiang1
xiang4 9,070
720
6,003
785 1,060
80
659
74
623,320 69,037
当
相
……
Overall
-
As in table 1, the frequency of different pronunciations in-
side a polyphonic character varies greatly both in the training
set and test set. The polyphonic characters “量” can be pro-
nounced as “liang4” and “liang2”. However, Pinyin ‘liang4’ ap-
pears 5,402 times in the training set and 571 times in test set
which are much larger than that of ‘liang2’. The same situation
occurs when it comes to polyphonic characters “当” and “相”.
This reveals the uneven distribution in polyphones within da-
taset.
4.2. Experimental Setting
We implemented the following five systems and used accuracy
rate as evaluation criteria for comparing:
1. BLSTM: Strictly following the description in [5], we
implemented BLSTM model for the task as baseline.
NLPIR is adopted for Chinese word segmentation and
1 https://www.data-baker.com/bz_dyz_en.html.
4. 2.
3.
4.
5.
POS tagging on the input sentence. We set the layers
of BLSTM to 2 and the hidden size to 512. The con-
textual size of polyphonic characters is set to 1 to con-
struct the POS sequence according to [5].
B-CNN: The input sequence included Chinese charac-
ter, Chinese word segment, POS tagging, polyphones
token and flag token. Context features encoding layer
that consists of BLSTM and CNN was to capture the
long-range context features. The number of layers in
BLSTM and CNN were both set to 2. The hidden size
of BLSTM was set to 512. The setting of strides of
CNN were 2,3,4, and the kernel num was 64. Rather
than modelling the context word, we treated the input
sentence as an instance to model. In the training pro-
cess, we adopted the Adam as optimizer and set the
learning rate to 5.0e-4. We split the corpus into mini
batch with the batch size of 128.
LC-W: Same as system 2 but applied Weighted-soft-
max in the model additionally.
LC-F: Same as system 3 but applied Focal Loss in the
model. The parameter ? was set to 0.7.
LC-WM: Same as system 3 but applied Modified Fo-
cal Loss in the model additionally. The parameter ?
was set to 0.5, and the parameter ? was set to 0.7.
(a)
(b)
Figure 3: Probability distribution of
of prediction from system LC-WM, and the system correctly
predicted pronunciation as “zhong4”.
4.3.3. Impact of Modified Focal Loss
To illustrate the role of Modified Focal Loss, we collected the
accuracy of several polyphones suffered from imbalanced dis-
tribution mentioned in chapter 4.1. The accuracy rate from the
system LC-W, LC-F and LC-WM are listed as table3.
Table 3: The accuracy of polyphonic characters.
Character
量
4.3. Results and analysis 当
4.3.1. Evaluation of different systems 相
Table 2 reveals the accuracy of polyphone disambiguation in
different systems. It can be seen that B-CNN outperformed
BLSTM baseline model. Besides, LC-W gained a better result
than B-CNN, verifying the feasibility of the mask vector to
strengthen the input features and allow the model to focus on
the candidate polyphone. LC-F got similar performance as LC-
W. Particularly, LC-WM method achieved the best perfor-
mance, showing that the Modified Focal Loss can alleviate the
imbalance of polyphone distribution.
Table 2: The accuracy for different system.
System
Acc
BLSTM
95.55
B-CNN
97.44
LC-W
97.85
LC-F
97.82
LC-WM
97.92
4.3.2. Impact of Weighted-softmax
To illustrate the impact of Weighted-softmax, in the case of “他
提醒大家明天依旧要注意防晒防中暑” (He reminds every-
one to protect from the sun and avoid heatstroke tomorrow), we
draw the estimated probability distribution of the polyphonic
character “中” on the part of polyphones in Figure 3. As shown,
the darker of the location, the greater probability of correspond-
ing pronunciation. Figure 3(a) displays the probability distribu-
tion of the prediction towards “中” from the system B-CNN,
while Figure 3(b) represents the probability distribution from
the system LC-W. In this sentence, the true pronunciation of the
polyphonic character “中” is “zhong4 ”. As shown in Figure
3(a), the B-CNN system predicts the pronunciation to “huan2”
which is not reasonable. Moreover, the probability was allo-
cated to the entire Pinyin set rather than the candidate Pinyin
set. As for system LC-WM, only the probability of “zhong1”
and “zhong4” are not equal to zero. Both of them are the candi-
dates of “中”. Figure 3(b) represents the probability distribution
“中”.
Polyphone
liang4
liang2
dang1
dang4
xiang1
xiang4
LC-W
98.60
70.00
98.49
80.00
99.24
94.59
LC-F
99.82
55.00
98.77
77.50
98.94
94.59
LC-WM
99.65
70.00
99.39
87.50
99.39
95.95
The experimental results revealed that Modified Focal Loss
is highly conductive to minimize the adverse influence of im-
balanced distribution within Pinyin set. The accuracy of “dang4”
in LC-WM is 7.5% higher than that of LC-W and 10% higher
than that of LC-F. As the case of ‘xiang4’, LC-WM is 1.36 %
higher than that of system LC-W and LC-F. Besides, system
LC-WM got a slightly improvement compared to the other sys-
tems on massive examples such as “dang1” and “xiang1”. It in-
dicates that the Modified Focal Loss can improve competency
of the model in classifying rare and hard examples without
harming that of the massive examples.
5. Conclusions
In this paper, we proposed a mask-based architecture for Chi-
nese Mandarin polyphone disambiguation, where mask vector
is not only a part of input features but also a weighting factor in
Weighted-softmax. Besides, we optimized the loss function
from cross-entropy to Modified Focal Loss. The proposed ar-
chitecture can achieve an accuracy rate at 97.92%, a 2.37% im-
provement compared with that of the baseline model. The ex-
perimental results demonstrate that the mask vector can effec-
tively prevent model from predicting outside the candidate set.
In addition, Modified Focal Loss can ease the distribution im-
balance of Pinyin set.
In the future, we will make the proposed Weighted-softmax
and Modified Focal Loss collaborate with pre-trained models
such as Elmo and Bert to fulfill the task of polyphone disam-
biguation.
6. Acknowledgement
The authors would like to thank the data team for their assis-
tance.
5. 7. References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
L. Yi, L. Jian, H. Jie, and Z. Xiong, “Improved Grapheme-to-Pho-
neme Conversion for Mandarin TTS,” Tsinghua Science &Tech-
nology, vol. 14, no. 5, pp. 606–611, 2009.
H. Dong, J. Tao, a nd B. Xu, “Grapheme-to-Phoneme Conver-
sion in Chinese TTS System,” in 2004 International Symposium
on Chinese Spoken Language Processing (ISCSLP), pp. 165–168,
2004.
H. Zhang, J. Yu, W. Zhan, and S. Yu, “Disambiguation of Chinese
Polyphonic Characters,” in The First International Workshop
on MultiMedia Annotation (MMA2001), vol. 1, pp. 30–1, 2001.
J. Liu, W. Qu, X. Tang, Y. Zhang, and Y. Sun, “Polyphonic Word
Disambiguation with Machine Learning Approaches,” in 2010
Fourth International Conference on Genetic and Evolutionary
Computing (ICGEC), pp. 244–247, 2010.
C. Shan, L. Xie, and K. Yao, “A bi-directional lstm approach for
polyphone disambiguation in mandarin chinese,” in 2016 10th
International Symposium on Chinese Spoken Language Pro-
cessing (ISCSLP), IEEE, pp. 1–5, 2016.
Z. Cai, Y. Yang, C. Zhang, X. Qin, and M. Li, “Polyphone Disam-
biguation for Mandarin Chinese Using Conditional Neural Net-
work with Multi-level Embedding Features,” Proceedings of
Interspeech 2019, pp. 2110-2114, 2019.
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine transla-
tion by jointly learning to align and translate,” Proceeding of
the ICLR, 2015.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
in Advances in Neural Information Processing Systems, pp.
5998–6008, 2017.
M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,
and L. Zettlemoyer, "Deep contextualized word representa-
tions," Proceedings of the Conference of North American Chap-
ter of the Association for Computational Linguistics, (NAACL)
pp. 2227-2237, 2018.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-
training of deep bidirectional transformers for language un-
derstanding,”. Proceeding of the NAACL-HLT, vol.1, pp.4171-
4186, 2019.
Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q.
Le, "XLNet: Generalized Autoregressive Pretraining for Lan-
guage Understanding," Proceeding of the NeurIPS, vol. 32, pp.
5754-5764, 2019.
D. Dai, Z. Wu, S. Kang, X. Wu, J. Jia, D. Su, and H. Meng, "Disam-
biguation of Chinese Polyphones in an End-to-End Frame-
work with Semantic Features Extracted by Pre-Trained
BERT," Proceedings of the Interspeech 2019, pp. 2090-2094,
2019.
B. Yang, J. Zhong, and S. Liu, "Pre-Trained Text Representa-
tions for Improving Front-End Text Processing in Mandarin
Text-to-Speech Synthesis," Proceedings of the Interspeech
2019, pp. 4480-4484, 2019.
Kim Y. “Convolutional neural networks for sentence classifi-
cation” Procedings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP), pp1746-
1751,2014.
J.R.Novak, N.Minematsu, and K.Hirose, “Failure transitions for
joint n-gram models and g2p conversion”. In INTERSPEECH,
2013, pp. 1821-1825.
T. Lin, P. Goyal, R. Girshich, K. He, and P Dollár, "Focal loss for
dense object detection," in Proceedings of the IEEE interna-
tional conference on computer vision, pp. 2980-2988, 2017.