语义匹配模型鲁棒性

1. 语义匹配模型鲁棒性唐萌腾讯高级工程师张奇腾讯 NLP Tech Lead

2. 0 各类自然语言处理算法快速发展，在很多任务上甚至超越人类 2

3. 0 各类自然语言处理算法快速发展，在很多任务上甚至超越人类 3

4. 0 算法在实际应用中的效果却差强人意 Q: 小孩感冒流鼻涕小妙招 Q：小孩感冒流鼻涕怎么治疗 4

5. 0 学术界业也关注到了这个问题--模型对测试数据的微小变化非常敏感 HotFlip: White-Box Adversarial Examples for Text Classification Ebrahimi et al., HotFlip: White-Box Adversarial Examples for Text Classification, 2018. 5

6. 0 学术界业也关注到了这个问题--模型对测试数据的微小变化非常敏感 Benchmarking Robustness of Machine Reading Comprehension Models 大多数的MRC基准测试方法，仅评估模型在“in-domain”测试集上的效果,而未在测试时添加扰动 ,以评估模型的鲁棒性. Si et al., Benchmarking Robustness of Machine Reading Comprehension Models, ACL 2021 6

7. 0 学术界业也关注到了这个问题--模型对测试数据的微小变化非常敏感 Benchmarking Robustness of Machine Reading Comprehension Models Si et al., Benchmarking Robustness of Machine Reading Comprehension Models, ACL 2021 7

8. 0 学术界业也关注到了这个问题--模型对测试数据的微小变化非常敏感 Benchmarking Robustness of Machine Reading Comprehension Models Si et al., Benchmarking Robustness of Machine Reading Comprehension Models, ACL 2021 8

9. 0 为什么会这样？ Ø 问题1：为什么线下评测不能反映上述问题？ Ø 问题2：深度神经网络模型为什么会存在上述问题？ Ø 问题3：如何提升上线语义匹配算法的鲁棒性? 9

10. 0 为什么会这样？ Ø 问题1：为什么线下评测不能反映上述问题？ Ø 问题2：深度神经网络模型为什么会存在上述问题？ Ø 问题3：如何提升上线语义匹配算法的鲁棒性? 10

11. 1 数据集上存在偏置-WINOGRANDE AAAI 2020 Best Paper WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale Ø Winograd Schema Challenge (WSC) The trophy doesn’t fit into the brown suitcase because it’s too large. The trophy doesn’t fit into the brown suitcase because it’s too small. trophy / suitcase trophy / suitcase Ø RoBERTa large achieves 91.3% accuracy on a variant of WSC dataset Ø Have neural language models successfully acquired commonsense or are we overestimating the true capabilities of machine commonsense? Sakaguchi et al., WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale, AAAI 2020. 11

12. 1 数据集上存在偏置-WINOGRANDE AAAI 2020 Best Paper WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale Ø Language-based bias Sakaguchi et al., WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale, AAAI 2020. 12

13. 1 数据集上存在偏置-WINOGRANDE AAAI 2020 Best Paper WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale Ø 消除Dataset-specific Biases: WINOGRANDE数据集 1. 众包设计 2. 算法消偏 AFLITE 通过使用预先计算的神经网络嵌入，获取实例的稠密表示，而不是手动的识别词汇特征。 • • • • • RoBERTa fine-tuned on a small subset of the dataset. An ensemble of linear classifiers (logistic regressions). Trained on random subsets of the data . Determine whether the representation is strongly indicative of the correct answer option . Discard the corresponding instances. Sakaguchi et al., WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale, AAAI 2020. 13

14. 1 数据集上存在偏置-WINOGRANDE AAAI 2020 Best Paper WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale Sakaguchi et al., WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale, AAAI 2020. 14

15. 1 数据集上存在偏置-WINOGRANDE AAAI 2020 Best Paper WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale 作为验证集和测试集作为训练集 Sakaguchi et al., WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale, AAAI 2020. 15

16. 1 数据集采样对模型训练和测试重要影响 – Contrast Sets Evaluating Models’ Local Decision Boundaries via Contrast Sets (a) A two-dimensional dataset that requires a complex decision boundary to achieve high accuracy. (b) If the same data distribution is instead sampled with systematic gaps (e.g., due to annotator bias), a simple decision boundary can perform well on i.i.d. test data (shown outlined in pink). (c) Since filling in all gaps in the distribution is infeasible, a contrast set instead fills in a local ball around a test instance to evaluate the model’s decision boundary Gardner et al., Evaluating Models’ Local Decision Boundaries via Contrast Sets, EMNLP 2020 16

17. 1 数据集采样对模型训练和测试重要影响 – Contrast Sets Evaluating Models’ Local Decision Boundaries via Contrast Sets Ø 更严格的数据集合构建规范 The dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Gardner et al., Evaluating Models’ Local Decision Boundaries via Contrast Sets, EMNLP 2020 17

18. 1 数据集采样对模型训练和测试重要影响 – Contrast Sets Evaluating Models’ Local Decision Boundaries via Contrast Sets Gardner et al., Evaluating Models’ Local Decision Boundaries via Contrast Sets, EMNLP 2020 18

19. 1 问题1 ：为什么基于基准测试集合和常用评价指标的模式，不能反映上述问题？ Ø基准集合构建时通常存在数据偏置 1. 要消除数据集合偏置 2. 根据任务特性增加人工变形 19

20. 1 问题1 ：为什么基于基准测试集合和常用评价指标的模式，不能反映上述问题？语义匹配模型上的消除数据偏置实验总数据量：30w fine-tune数据：10w（9w + 1w） AFLITE输入数据：20w AFLITE筛选后数据：2.6w(2w + 0.6w) 实验 1 2 训练集随机数据量 9w 测试集随机数据量 2w 指标（acc） 90.6% 随机 9w AFLITE筛选 2w 31.9% 随机 20w-0.6w AFLITE筛选 0.6w 40.7% 2w AFLITE筛选 0.6w 61.0% AFLITE筛选 20

21. 1 问题1 ：为什么基于基准测试集合和常用评价指标的模式，不能反映上述问题？根据任务特性增加人工变形标签变换后标签构造方式原query 原title 变换query 心脏病的治疗心脏病该怎么治心脏病 1 0 删除 1 0 增加心脏病的治疗心脏病该怎么治心脏病高血压的治疗先天性心脏病的治疗青年心脏病的治疗 1 0 替换心脏病的治疗心脏病该怎么治心衰的治疗 1 1 基于语义向量身上起小红点怎么办皮肤上有小红点怎么办身上长了很多小红点怎么治疗身上起小红点怎么办皮肤上有小红点怎么办身上起红疹怎么办基于医疗语义分类身上起小红点怎么办皮肤上有小红点怎么办身上起小红点涂什么药 1 1 1 0 基于医疗知识 21

22. 0 为什么会这样？ Ø 问题1：为什么线下评测不能反映上述问题？ Ø 问题2：深度神经网络模型为什么会存在上述问题？ Ø 问题3：如何提升上线语义匹配算法的鲁棒性? 22

23. 2 BERT中Attention Head学习到了丰富的高层语言特征 What Does BERT Look At? An Analysis of BERT’s Attention Attention heads exhibiting patterns Attention heads corresponding to linguistic phenomena BERT’s attention heads exhibit patterns such as attending to delimiter tokens, specific positional offsets, or broadly attending over the whole sentence Certain attention heads correspond well to linguistic notions of syntax and coreference. Attention-based probing classifier demonstrated that substantial syntactic information could be captured in BERT’s attention. Clark et al. , What Does BERT Look At? An Analysis of BERT’s Attention , ACL 2019 23

24. 2 Integrated Gradients 归因方法 What Does BERT Look At? An Analysis of BERT’s Attention Integrated Gradients (IG) (Sundararajan et al., 2017) to isolate question words that a deep learning system uses to produce an answer. Red -- high attribution Blue -- negative attribution For image networks, the baseline input x' could be the black image, while for text models it could be the zero embedding vector. Gray -- near-zero attribution Mudrakarta et al. Did the Model Understand the Question? ACL 2018 24

25. 2 归因方法的实践基于RoBERTa的用户查询-文章语义匹配模型用户查询：孩子手脚汗疱疹可以自愈么 25

26. 2 归因方法的实践基于RoBERTa的查询-文章语义匹配模型用户查询：结肠炎吃什么食物好 26

27. 2 归因方法的实践基于RoBERTa的查询-文章语义匹配模型用户查询：硫酸沙丁胺醇吸入气雾剂用法 27

28. 2 问题2：深度神经网络模型到底学习到了什么？ Ø 预训练语言模型, 提供了句法等高层语言特征 Ø 预训练语言模型,学习到了部分复述（Paraphrase）的相似表示 Ø 高层语言特征与词表层特征综合为上层任务提供了表示 Ø 特征提取+特征组合能力覆盖了人工构造的基础特征，以及人工很难构造的特征高阶组合能力 Ø 超强的数据拟合能力、泛化能力在训练集和测试集独立同分布的前提下,模型具备相对更强的泛化能力 28

29. 0 为什么会这样？ Ø 问题1：为什么线下评测不能反映上述问题？ Ø 问题2：深度神经网络模型为什么会存在上述问题？ Ø 问题3：如何提升上线语义匹配算法的鲁棒性? 29

30. 3 语义匹配模型通用做法 Ø step1 : pre-trained language models Ø step2 : classification objective for fine-tuning directly perform text comparison by processing each word uniformly 30

31. 3 语义匹配模型核心词匹配与意图匹配 31

32. 3 语义匹配模型核心词匹配与意图匹配 32

33. 3 语义匹配模型基础模块 Ø Text Semantic Matching using PLMs Ø Keyword and Intent Identification with Distant Supervision Ø Divide-and-Conquer Matching Strategy 33

34. 3 语义匹配模型 Text Semantic Matching using PLMs ℎ !"# ; ? $,& = ???([w !"# ; ? $,& ]) ? ? ? $ , ? & = ????? ?? ℎ !"# ∗ ? ' ? #( = − log ? ? ? $ , ? & ) 34

35. 3 语义匹配模型 Keyword and Intent Identification with Distant Supervision Ø Distant supervision 基于外部知识库 Ø Auxiliary training objective L )# $,& $,& ' > > = −[log ? ℎ * ? )# + log ?(− ℎ + ? )#' )] 35

36. 3 语义匹配模型 Divide-and-Conquer Matching Strategy Ø Divide P P $ & ? * ? * , ? * ) $ & ? + ? + , ? + ) masked intent words masked keywords Ø Combined the solutions to the sub-problems $ & $ & $ & ? ? * , ? + ? , ? ) = ?(? * |? * , ? * ) ?(? + |? + , ? + ) ? ? = ? , = ? ? * = ? , , ? + = ? , + ∑ ! ! -! " ? ? * = ? , , ? + = ? ( + ∑ ! ! -! " ? ? * = ? ( , ? + = ? , Ø Associated 增强全局匹配的可解释性 L ./ = ? 01 [? ? ? $ , ? & || ?(? | ? $ , ? & )] 36

37. 3 语义匹配模型 Experiment 37

38. 3 语义匹配模型 Robust Evaluation 38

39. 4 其他增强模型鲁棒性方向前向loss阶段 Ø Multi-sample Dropout && R-drop Hiroshi Inoue. Multi-Sample Dropout for Accelerated Training and Better Generalization Xiaobo Liang et al. R-Drop: Regularized Dropout for Neural Networks 39

40. 4 其他增强模型鲁棒性的trick 反向传播阶段 Ø 对抗训练 && Child tuning ? ?? 2 ? ?, ? ~? [? ?? 3 #$% 45 ? ?, ? + ? $)6 , ? ] Runxin Xu et al., Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning. 2021 40

41. 4 提升NLP算法鲁棒性是个系统工程每个环节都会对模型的鲁棒性产生影响任务建模 !"#$ %&'( )*#$ +,-. 41

42. Thanks