近些年来,各种bert系的预训练模型层出不穷,其采用海量文本数据预训练的方式,能够使得模型学习到更多语言知识。其中,roberta[5]模型通过采用的动态mask language model训练任务,更大的batch_size,同时去除了效果不明显的nsp任务。相比于bert,在下游的任务中能取得更好的效果。 bert系模型学习mlm任务是比较充分的,能够学习token之间的相关性信息。但在做句子级文本表征时,由于存在高频词和低频次的差异,经过mean-pooling后的句子表征在空间中是非平滑的[6],从而并不能通过直接的相似度计算来衡量文本间的相似度。因此我们需要基于业务类型的需求,针对句子级别的表征设置更合理的finetune任务,来使得句子表征更合理。
[1] Mikolov T , Chen K , Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[J]. Computer Science, 2013. [2] Pennington J , Socher R , Manning C . Glove: Global Vectors for Word Representation[C]// Conference on Empirical Methods in Natural Language Processing. 2014. [3] Joulin A , Grave E , Bojanowski P , et al. Bag of Tricks for Efficient Text Classification[J]. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017. [4] Kim Y . Convolutional Neural Networks for Sentence Classification[J]. Eprint Arxiv, 2014. [5] Liu Y , Ott M , Goyal N , et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[J]. 2019. [6] Li B , Zhou H , He J , et al. On the Sentence Embeddings from Pre-trained Language Models[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. [7] https://github.com/ZhuiyiTechnology/simbert [8] Gao T , Yao X , Chen D . SimCSE: Simple Contrastive Learning of Sentence Embeddings[J]. 2021.