具体来说,向量召回即给定对话上文(Context,Q),检索得到答案集合(Response,A),一个最基本的问题就是召回方式的选择(QQ vs QA),最终我们选了QQ的方式来进行检索召回,即构建Context-Response Pair对,将Context表示为向量后检索召回索引中相似的历史Context,再使用这些历史Context对应的历史Response作为召回结果。这样选择的核心原因在于:Context与Response之间并非单纯的语义相似或相关关系,更多的是一种顺承推理的关系,难以用基于相似度或距离的向量检索方案来直接处理,通过引入历史Context作为其中的"桥梁",可以让建模变得更加简单。
目前,预训练语言模型(如BERT、GPT等)已经广泛应用于许多NLP任务。众多文章证明了,哪怕不使用额外的数据,仅在领域相关的数据继续预训练(Domain-Adaptive Pretraining)依然可以带来性能效果的提升,例如Masked Language Model(MLM)、Sentence Order Prediction(SOP)等通用预训练任务。并且也可以进行任务特定的预训练(Task-Specific Pretraining),使得预训练模型提前学习到相关任务的信息与模式。同时,预训练任务大都是自监督任务,也可以在多任务学习(Multi-Task Learning)的框架下用作主任务的辅助性任务进行联合训练。
Token-Level的任务大多是通用NLP任务。最简单的Language Model(LM)任务,基于上文预测下一个单词。BERT的Masked Language Model(MLM)任务,根据句子中其余的词来预测被Mask的词。XLNet的Permutation Language Model(PLM )任务,将句子中的Token随机排列后用自回归的方法训练预测末尾的Tokens。
多元相对论[14]:次序关系,注重回复质量的多样性,主要工作在于如何构造数据建模更细粒度的好坏关系。作者使用生成(Generation)或者检索(Retrieval)的方式来构造所谓的灰度数据(Grayscale),并希望模型学习“Ground Truth Response > Greyscale Response > Random Sampled Response”的渐进关系,最终损失函数同时建模“Ground Truth > Random”、“Ground Truth > Retrieval > Random”、“Ground Truth > Generation > Random”三类次序关系。常见的建模方式为Pairwise。
结合我们当前的场景,这两类方法的典型对比如下图9所示,区别在于将召回集合视作难负例还是灰度数据。图9 排序任务两种建模方式(Pointwise vs Pairwise)上述的基线模型就是Pointwise的建模方式,针对二元组学习一个0-1之间的分数,其损失函数为交叉熵函数。而Pairwise建模方式,则针对三元组进行分类,对具体的打分不关心,只需要更相关的样例得分更高即可。一般来说有两种类型的损失函数,其一是比较著名的RankNet[15]模型,公式如下,记为Logistic形式,其中分别代表两个Response的得分,当时,;当时,。其二为合页损失,记为Hinge形式,其中m为阈值边界,表示有错误答案排到了正确答案的前面。实验结果表明,在Pairwise设置下Logistic形式的损失效果优于Hinge形式,并且GT > Retrieval > Random增强有效。同时,Pointwise和Pairwise建模方式无绝对的高低上下之分,效果好坏取决于场景和数据特性。事实上在线坐席CHAT场景中Pairwise更好,商家IM场景中Pointwise更好,联合建模(Pointwise+Pairwise or Pointwise->Pairwise)效果略有提升。
[1] Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
[2] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "Glove: Global vectors for word representation." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.
[3] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
[4] Reimers, Nils, and I. Sentence-BERT Gurevych. "Sentence Embeddings using Siamese BERT-Networks. arXiv 2019." arXiv preprint arXiv:1908.10084 (1908).
[5] Liu, Yiding, et al. "Pre-trained language model for web-scale retrieval in baidu search." Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021.
[6] Humeau, Samuel, et al. "Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring." arXiv preprint arXiv:1905.01969 (2019).
[7] Cen, Yukuo, et al. "Controllable multi-interest framework for recommendation." Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020.
[8] Tang, Hongyin, et al. "Improving document representations by generating pseudo query embeddings for dense retrieval." arXiv preprint arXiv:2105.03599 (2021).
[9] Whang, Taesun, et al. "An effective domain adaptive post-training method for bert in response selection." arXiv preprint arXiv:1908.04812 (2019).
[10] Mehri, Shikib, et al. "Pretraining methods for dialog context representation learning." arXiv preprint arXiv:1906.00414 (2019).
[11] Xu, Ruijian, et al. "Learning an effective context-response matching model with self-supervised tasks for retrieval-based dialogues." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 16. 2021.
[12] Li, Junlong, et al. "Task-specific objectives of pre-trained language models for dialogue adaptation." arXiv preprint arXiv:2009.04984 (2020).
[13] Qiu, Yao, et al. "Challenging Instances are Worth Learning: Generating Valuable Negative Samples for Response Selection Training." arXiv preprint arXiv:2109.06538 (2021).
[14] Lin, Zibo, et al. "The world is not binary: Learning to rank with grayscale data for dialogue response selection." arXiv preprint arXiv:2004.02421 (2020).
[15] Burges, Chris, et al. "Learning to rank using gradient descent." Proceedings of the 22nd international conference on Machine learning. 2005.
[16] Zhang, Wentao, Shuang Xu, and Haoran Huang. "Two-Level Supervised Contrastive Learning for Response Selection in Multi-Turn Dialogue." arXiv preprint arXiv:2203.00793 (2022).
[17] Li, Yuntao, et al. "Small Changes Make Big Differences: Improving Multi-turn Response Selection in Dialogue Systems via Fine-Grained Contrastive Learning." arXiv preprint arXiv:2111.10154 (2021).
[18] Wu, Lijun, et al. "R-drop: Regularized dropout for neural networks." Advances in Neural Information Processing Systems 34 (2021): 10890-10905.
[19] Karpukhin, Vladimir, et al. "Dense Passage Retrieval for Open-Domain Question Answering." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
[20] Li, Jiwei, et al. "A Persona-Based Neural Conversation Model." Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016.