从以上问题出发,需要一个不依赖候选的生成式改写模型,我们考虑使用深度语义翻译模型NMT来解决这类问题。2016年年底Google公布的神经网络机器翻译(GNMT)[19]宣告了神经网络机器翻译性能超过1989年的IBM机器翻译模型(SMT,基于短语的机器翻译模型)。推动这一巨大发展就是引入Attention机制[20]的Sequence to Sequence(Seq2Seq)的端到端模型。但在实际的使用中发现,NMT生成的改写词存在不符合语义(生僻或不通顺)以及改写有语义漂移两个问题,导致在线上新增改写的有效比例低,甚至会导致严重的漂移Case。因此要引入NMT做改写必须结合搜索的使用场景对以上两个问题做优化,目标是生成无意图漂移、能够产生实际召回影响的改写词。基于以上问题分析和思考,通过引入环境因素引导NMT生成更高质量的改写是大方向目标,从这个角度出发我们调研了强化学习的方法。强化学习的过程是一个智能体(Agent)采取行动(Action)从而改变自己的状态(State)获得奖励(Reward)与环境(Environment)发生交互的循环过程。我们希望借助强化学习的思想,将预训练的NMT改写模型作为Agent,在强化学习迭代的过程中其生成的改写(Action)通过搜索系统(Environment)产生最终的曝光和点击(Reward)来指导NMT优化模型参数(State)。经过进一步调研,我们参考了Google QA系统[21]以及知乎的工作[22],即通过强化学习的方法,把搜索系统当做一个Environment,改写模型当做Agent,从而将大搜的结果质量考虑进来。但由于美团场景下的排序与位置、用户等排序因素强相关,将整个大搜作为Environment将改写词召回向前排序的反馈机制不可借鉴,并且请求在线排序会导致训练速度慢等一系列工程问题。结合NMT实际的表现,考虑优先保障生成改写的语义相似度,使用大搜召回日志结合BERT语义判别模型做Environment,目标为原词改写词在搜索系统交互中的商户集合的交叉度和自然语义相似度。最终整体的框架图如下所示:
本文介绍了美团场景下查询改写任务上的探索和实践经验,在垂直领域搜索召回这一课题上结合实际业务场景和用户需求探索了语义判别模型、语义检索模型、图模型等前沿算法技术,积累了生活服务领域短语关联认知数据。其中在离线数据部分介绍了策略、统计翻译、图方法和Embedding等多种技术角度的挖掘方法,并对总结了各个方法在实践过程中的出发点、效果和优缺点。在线模型方面结合垂直领域搜索的结构化检索特点,设计了高精度的词典改写、较高精度的模型改写(基于SMT统计翻译模型和XGBoost排序模型)、覆盖长尾Query的基于强化学习方法优化的NMT模型、针对商户搜索的向量化召回四种线上方案。目前,在美团App搜索中有改写流量占比约73%,在大众点评App搜索有改写流量占比约67%。构建的查询改写能力和服务平台支持各个业务频道内搜索以及搜索广告平台等,并取得了不错的收益。现在查询改写服务高峰期集群QPS(Query Per Second)已经达到了6万次/秒,我们会进一步投入研发,提升公司内乃至业界内的技术影响力。如何更好地连接用户和平台上的服务、商家、商品是一个需要长期和多方面投入解决的问题。我们未来可能会进行以下几个方向的迭代:
[2] Antonellis, Ioannis, Hector Garcia-Molina, and Chi-Chao Chang. "Simrank++ query rewriting through link analysis of the clickgraph (poster)." Proceedings of the 17th international conference on World Wide Web. 2008.
[3] Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. "Deepwalk: Online learning of social representations." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014.
[4] Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
[5] Grbovic, Mihajlo, et al. "Context-and content-aware embeddings for query rewriting in sponsored search." Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. 2015.
[6] Djuric, Nemanja, et al. "Hierarchical neural language models for joint representation of streaming documents and their content." Proceedings of the 24th international conference on world wide web. 2015.
[7] Shen, Yelong, et al. "Learning semantic representations using convolutional neural networks for web search." Proceedings of the 23rd international conference on world wide web. 2014.
[8] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
[9] Reimers, Nils, and Iryna Gurevych. "Sentence-bert: Sentence embeddings using siamese bert-networks." arXiv preprint arXiv:1908.10084 (2019).
[10] Gao, Tianyu, Xingcheng Yao, and Danqi Chen. "SimCSE: Simple Contrastive Learning of Sentence Embeddings." arXiv preprint arXiv:2104.08821 (2021).
[11] Johnson, Jeff, Matthijs Douze, and Hervé Jégou. "Billion-scale similarity search with gpus." IEEE Transactions on Big Data (2019).
[13] Liang X, Wu L, Li J, et al. R-Drop: Regularized Dropout for Neural Networks[J]. arXiv preprint arXiv:2106.14448, 2021.
[14] Xu, Runxin, et al. "Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning." arXiv preprint arXiv:2109.05687 (2021).
[15] Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." arXiv preprint arXiv:1609.02907 (2016).
[17] Yin, Dawei, et al. "Ranking relevance in yahoo search." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016.
[18] He, Yunlong, et al. "Learning to rewrite queries." Proceedings of the 25th ACM International on Conference on Information and Knowledge Ma.
[19] Wu, Yonghui, et al. "Google's neural machine translation system: Bridging the gap between human and machine translation." arXiv preprint arXiv:1609.08144 (2016).
[20] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017): 5998-6008.
[21] Buck, Christian, et al. "Ask the right questions: Active question reformulation with reinforcement learning." arXiv preprint arXiv:1705.07830 (2017).
[23] Huang, Jui-Ting, et al. "Embedding-based retrieval in facebook search." Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020.
[24] Li, Sen, et al. "Embedding-based Product Retrieval in Taobao Search." arXiv preprint arXiv:2106.09297 (2021).
[25] Zhang, Han, et al. "Towards personalized and semantic retrieval: An end-to-end solution for e-commerce search via embedding learning." Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020.
[26] Yu, Lantao, et al. "Seqgan: Sequence generative adversarial nets with policy gradient." Proceedings of the AAAI conference on artificial intelligence. Vol. 31. No. 1. 2017.
[27] Luo, Xusheng, et al. "AliCoCo: Alibaba e-commerce cognitive concept net." Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2020.
---------- END ----------招聘信息美团搜索与NLP部/NLP中心是负责美团人工智能技术研发的核心团队,使命是打造世界一流的自然语言处理核心技术和服务能力,依托NLP(自然语言处理)、Deep Learning(深度学习)、Knowledge Graph(知识图谱)等技术,处理美团海量文本数据,为美团各项业务提供智能的文本语义理解服务。NLP中心长期招聘自然语言处理算法专家/机器学习算法专家,感兴趣的同学可以将简历发送至:tech@meituan.com(邮件主题:美团搜索与NLP部)。