美团内容智能分发的算法实践

如果无法正常显示，请先停止浏览器的去广告插件。

相关话题： #美团

1. 美团技术沙龙080期：美团内容智能分发的算法实践 Aligning Vision and Language on Temporal Dimension: Cross-modal Video Content Retrieval 视频-文本跨模态时序对齐与内容检索徐行研究员电子科技大学未来媒体研究中心&计算机科学与工程学院 2023年12月23日 1

2. Introduction  Widely spread of massive untrimmed videos in our daily life Untrimmed videos are usually redundant, but contain rich and informative fragments.  Using natural language text to retrieve key information from untrimmed videos is a promising solution.  Videos on Internet Mobile Video Apps Video Surveillance 2

3. Introduction  Applications of Cross-Modal Video Content Retrieval  Searching related video content with given text query on single long-term video or video galleries .  Retrieved video content can be entire video file or partial video clips . (a) Cross-modal Video File Retrieval (c) Video Content Retrieval in Video Gallery (b) Video Content Retrieval in Single Video 3

4. Introduction  Our joint work with Meituan on Cross-modal Video Content Retrieval  Video Content Retrieval in Single Video • SDN: Semantic Decoupling Network for Temporal Language Grounding. IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2022. (中科院JCR 1区） • GTLR: Graph-based Transformer with Language Reconstruction for Video Paragraph Grounding, IEEE International Conference on Multimedia Expo, 2022 (最佳学生论文奖) • Semi-supervised Video Paragraph Grounding with Contrastive Encoder, IEEE International Conference on Computer Vision and Pattern Recognition, CVPR, 2022 (CCF A) • Faster Video Moment Retrieval with Point-Level Supervision. ACM International Conference on Multimedia (ACM MM), 2023. (CCF A)  Video Content Retrieval in Video Gallery  Joint Searching and Grounding: Multi-Granularity Video Content Retrieval. ACM International Conference on Multimedia (ACM MM), 2023. (CCF A) Code available: https://github.com/CFM-MSG 5

5. Introduction  Previous Methods  Proposal-based Methods: two-stage paradigm that generates candidate moments then conducts cross-modal alignments.  Proposal-free Methods: one-stage paradigm that predicts latent target moments directly. Proposal-based method: CTRL [Gao et. al. ICCV 2017] Proposal-free method: LGI [Mun et. al. CVPR 2020] [1] Gao, Jiyang and Sun, Chen and Yang, Zhenheng and Nevatia, Ram. "Tall: Temporal activity localization via language query." ICCV. 2017. [2] Mun, Jonghwan, Minsu Cho, and Bohyung Han. "Local-global video-text interactions for temporal grounding." CVPR. 2020. 6

6. Motivation  Observations  Proposal-based Methods: Achieving high recall rate with redundant candidate moments but suffering from the constant-scale proposals.  Proposal-free Methods: Obtaining better flexibility and efficiency on temporal localization but showing more weakness on recall rate. (a) Proposal-based method (c) Semantic Decoupling Network (Ours) (b) Proposal-free method If there exists a framework that inherits their merits and overcomes the defects? 7

7. Our Proposed Method  Semantic Decoupling  Assumption: the proposal-based method performs well because the semantics of video segments can be determined by corresponding available temporal scales.  By observing the proportion of valid segments in each untrimmed video, we also find the untrimmed video contains massive irrelevant content. “Person” “Door” “Person” “Door” “Walking” “Person walks into the door to fetch his cloth” (a) Semantic Completeness (b) Statistical results on proportion of valid content  We argue that the more temporal information will turn static objectives into action and, eventually, the complete events. The occurrence frequency also decreases according to the semantic level. [3] Xun Jiang, Xing Xu*, Jingran Zhang, Fumin Shen, Zuo Cao and Heng Tao Shen, "SDN: Semantic Decoupling Network for Temporal Language 8 Grounding“, TNNLS, 2022

8. Our Proposed Method  Contributions  We propose a novel framework, Semantic Decoupling Network (SDN), which inherits the benefits of proposal-based and proposal-free methods and overcomes their defects.  It consists of three key components: (1) Semantic Decoupling Module (SDM), (2) Context Modeling Block (CMB), (3) Semantic Cross-level Aggregating Module (SCAM). 9

9. Experiments  Datasets  ActivityNet-Caption        More than 20k videos from Youtube Open domain More than 15k vocabularies Long videos with 117s averaging Complex semantics Sufficient contexts Charades-STA       More than 6k videos from Youtube Indoors domain About 1300 vocabularies Short videos with 30s averaging Simple semantics Insufficient contexts 10

10. Experiments  Overall Comparisons  We evaluate our methods on three widely used datasets, i.e., Charades-STA, ActivityNet- Caption, and TACoS.  The experimental results demonstrate our method outperforms both proposal-based and proposal-free SOTA methods at the time. Comparisons on TACoS Comparisons on Charades-STA Comparisons on ActivityNet-Caption 11

11. Experiments  Visualization of Retrieval Results  We visualize several event retrieval cases on short videos and long videos both, and make a comparison with the SOTA methods at the time.  Our methods reveals obvious superiority on localizing events precisely (proposal-free) and understanding global contextual information (proposal-based). 4 test cases from the Charades-STA (subfigure a, b) and TACoS (subfigure c, d) datasets. 12

12. Rethinking  Video Sentence Grounding (VSG)  Almost all the previous methods study the task that aims to localize an event from an untrimmed video with a sentence query in natural language. Untrimmed Video (Multiple Events) … Sentence Query (Single Event) Two young girls are standing in the kitchen preparing to cook. … Single-Multi Localization Target Moment [CTRL Gao et al, 2017] [MCN Hendricks et al, 2017] [LGI Mun et al, 2020] … [0s, 15.67s] 13

13. Rethinking  Video Paragraph Grounding  A new task that targets to localize multiple moments from an untrimmed video with a given natural language paragraph query (multiple sentence queries). Untrimmed Video (Multiple Events) … … Paragraph Query (Multiple Events) After, the two continue to stir a the They then open box in are an of contents brownies…get Two young girls egg out in of the the kitchen fridge. standing preparing to cook. Target Moments [0s, 15.67s] Multi-Multi Localization (Not fully explored) [15.67s, 28.73s] [50.12s, 57.59s] 14

14. Rethinking  Multi-Multi Localization: Chance and Challenge  The paragraph queries bring more context information, but also increase the difficulty in multimodal representation and alignment 15

15. Our Proposed Method  Graph-based Transformer with Language Reconstruction (GTLR) A. Multimodal Graph Encoder that conducts message passing among fine-grained features B. Event-wise Decoder that explores contexts and achieve multi-multi localization C. Language Reconstructor that aligns events with their corresponding sentences C A od En Position Embedding C Concatenate Clip-level Features Sentence.3: Sentence.2: Word-level Features Sentence.1: A woman is wrapping a box. Sentence-level Features B [4] Xun Jiang, Xing Xu*, Jingran Zhang, Fumin Shen, Zuo Cao and Xunliang Cai, “GTLR: Graph-Based Transformer with Language Reconstruction for Video Paragraph Grounding.” ICME. 2022. (Best Student Paper Award) 16

16. Experiments  Datasets  ActivityNet-Caption        More than 20k videos from Youtube Open domain More than 15k vocabularies Long videos with 117s averaging Complex semantics Sufficient contexts Charades-STA       More than 6k videos from Youtube Indoors domain About 1300 vocabularies Short videos with 30s averaging Simple semantics Insufficient contexts 17

17. Experiments  Overall Comparison    Our method outperforms conventional VSG methods with a large margin. It also achieves greatly improvement compared with the previous VPG method. The GTLR performs better with more context information in paragraph queries. Overall Comparison on Two Benchmark Datasets ActivityNet-Caption Charades-STA Method Venue CBP ACL’20 54.3 35.76 17.8 36.85 - 36.80 18.87 - CPNet AAAI’21 - 40.56 21.63 40.65 - 40.32 22.47 37.36 BPNet AAAI’21 58.98 42.07 24.69 42.11 55.46 38.25 20.51 38.03 SV-VMR ICME’21 61.39 45.21 27.32 - - 38.09 19.98 - FVMR CVPR’21 53.12 41.48 29.12 - - 38.16 18.22 - DepNet AAAI’21 72.81 55.91 33.46 - 58.47 40.23 21.08 39.11 GTLR Ours 77.49 60.80 37.73 55.20 62.96 44.31 22.80 42.37 IoU=0.3 IoU=0.5 IoU=0.7 mIoU IoU=0.3 IoU=0.5 IoU=0.7 mIoU 18

18. Experiments  Qualitative Analysis    The GTLR achieves remarkable improvement on the localizing precision. Language Reconstruction makes it more explainable and intractable. As constraints for multimodal alignments, the reconstructed words show a semantic-clustering tendency. Visualization of two typical grounding examples by our GTLR and the counterpart DepNet [AAAI’21] 19

19. Visualization Demo  Cross-modal Video Content Retrieval   Our GTLR method effectively aligns two modalities and parallelly retrieval multiple events. This work is also presented for the ICME 2022 Best Student Paper Award. Query: Two young girls stand in a kitchen, with one holding a bag of something and talking and the other standing on a chair. Then, The two collaborate to … A case from the ActivityNet-Caption dataset ICME 2022 Best Student Paper 20

20. Discussion  Existing Challenges for VPG  Intra-modality contextual information is not explored sufficient in GTLR because of the fine-grained cross-modal interactions.  Video-Paragraph pairs require heavy annotation cost compared with Video- Sentence pairs. Pair 1 Pair 2 Pair 5 Video-Paragraph Annotations Video-Sentence Annotations 21

21. Our Proposed Method Semi-supervised Video-Paragraph Transformer (SVPTR)  We first build a base model, namely, Video-Paragraph TRansformer (VPTR).  Based on the fully-supervised base model VPTR, we introduce the mean-teacher framework thus achieve the Semi-supervised Learning VPG. [5] Xun Jiang, Xing Xu*, Jingran Zhang, Fumin Shen, Zuo Cao, Heng Tao Shen, “Semi-Supervised Video Paragraph Grounding With Contrastive Encoder.” CVPR. 2022. 22

22. Experiments  Overall Comparisons  We evaluate our method under fully-supervised and semi-supervised settings respectively. It shows competitive performance with less temporal annotated data.  Our proposed SVPTR method achieves remarkable improvement compared with both previous VSG and VPG methods. Semi-supervised Performance on the ActivityNet- Caption, TACoS, and Charades-CD-OOD Fully-supervised Performance on the TACoS Fully-supervised Performance on the ActivityNet-Caption 23

23. Experiments  Further Analysis and Visualization  The experimental results on different proportions of labeled training data show the effectiveness of our SVPTR method.  We also visualize the attention weights of decoders to furtherly illustrate the process of event-wise cross-modal alignments. Analysis on the annotated data Visualization of attention weights in event-wise decoder 24

24. The Latest Works  Exploration on Model Practice: Data and Computation  Existing methods follows a multimodal fusion pipeline to retrieve video content, which leads to heavy computational consumption.  Completely temporal annotations are subjective and hard to label thus bringing high data cost. We achieve more practical retrieval paradigm with a fair trade-off among data cost, model efficiency, and retrieval accuracy! 25

25. Our Proposed Method  Cheaper and Faster Moment Retrieval (CFMR)  We proposed Concept-based Multimodal Alignment module that avoids online cross-modal interactions widely used in previous methods.  We introduced point-level supervised learning to reduce annotation cost and Point- guided Contrastive Learning to leverage incomplete supervision signals. [6] Xun Jiang, Zailei Zhou, Xing Xu*, Yang Yang, Guoqing Wang, Heng Tao Shen, Faster Video Moment Retrieval with Point-Level Supervision. ACM MM. 2023. 26

26. Experiments  Retrieval Accuracy  Similarly, we evaluate our proposed CFMR method on the Charades-STA, ActivityNet-Captions, and TACoS datasets.  During training, the complete temporal annotations are inaccessible, but only point- level supervision provided. 27

27. Experiments  Retrieval Efficiency  In the inference, the semantic reconstructor is disabled and only two concept encoders are deployed, thus no cross-modal interactions or fusion is required.  As our method bypasses cross-modal interactions, the online computation, which can be only propelled after receiving users’ queries, has been reduced significantly. 28

28. Experiments  Further Analysis  We furtherly analyze our method on ablation studies, retrieval efficiency on long- term videos with different lengths, visualizations of multimodal representation, and performance on out-of-distribution datasets. Ablation Studies Efficiency Analysis Multimodal Representation OOD Performance 29

29. Experiments  Further Analysis  Finally, we also conduct qualitative analysis to make comprehensive comparisons on retrieval performance, efficiency, and annotation cost. 30

30. The Latest Works  Towards Applications: Retrieving Content from Video Collections  Exploring a Novel Fine-grained Video Retrieval Paradigm: a unified framework for retrieving Video-Text Retrieval and Video Content Retrieval.  Existing fusion-based methods are NOT compatible with video-text retrieval, as it is IMPOSSIBLE to conduct cross-modal interactions with each video in collections.  Following the text-based video retrieval seetings: NO additional temporal annotations. 31

31. Multi-Granularity Video Content Retrieval  Our Solution: Joint Search and Grounding (JSG)  Retrieving untrimmed videos that contains multiple events with partial text queries from video collections and synchronously retrieving the most related video events.  We design two branches to tackle this challenging problem, i.e., Glance Branch and Gaze Branch, which retrieve video files and events collaboratively. [7] Zhiguo Chen, Xun Jiang, Xing Xu*, Zuo Cao, Yijun Mo, Heng Tao Shen, Joint Searching and Grounding: Multi-Granularity Video Content Retrieval. ACM MM. 2023. 32

32. Experiments  Evaluate on Two Subtasks  We evaluate our JSG method on both video-level retrieval and event-level retrieval tasks, which comprehensively show the multi-granularity video retrieval performance.  Note that ALL temporal annotations are INACCESSIBLE. 33

33. Experiments  Further Analysis  We also conduct more ablation studies and observe the retrieval performance on two branches, i.e., Glance Branch and Gaze Branch, to take quantitative analysis on our proposed JSG method. 34

34. Experiments  Further Analysis  Here we demonstrate some retrieval cases from our experimental datasets. 35

35. Generative LLM-Driven Video Understanding  The prosperity of LLMs and the development of NLP research have demonstrated that data-centric AI is a certain way to well-performing models with high generalization.  LLMs-Driven pre-trained multimodal models will be a popular multimodal learning paradigm.  Video-Text Alignment may by an essential topic in “Video+LLM” studies. Mini-GPT[Zhu, et al., arXiv’23] BLIP-2 [Li, et al., arXiv’23] LLAVA [Liu, et al., arXiv’23] [8] Deyao Zhu, et al. "MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models." arXiv. 2023. [9] Junnan Li, et al. "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models." arXiv. 2023. [10] Haotian Liu, et al. “Visual Instruction Tuning”. arXiv. 2023. 36

36. Generative LLM-Driven Video Understanding  Leveraging Generative LLMs is a booming research on Video-Language study, particularly for video understanding.  Video-ChatGPT [11] Maaz, Muhammad, et al. "Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models." arXiv. 2023. 37

37. Generative LLM-Driven Video Understanding  Leveraging Generative LLMs is a booming research on Video-Language study, particularly for video understanding.  VideoChat (AskAnything) [12] KunChang Li, et al. " VideoChat: Chat-Centric Video Understanding." arXiv. 2023. 38

38. Generative LLM-Driven Video Understanding  Leveraging Generative LLMs is a booming research on Video-Language study, particularly for video understanding.  Video-LLAMA: Bridging video/audio models and generative LLMs with Q-Former. [13] Hang Zhang, et al. "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding." EMNLP. 2023. 39

39. Generative LLM-Driven Video Understanding  Leveraging Generative LLMs is a booming research on Video-Language study, particularly for video understanding.  Video-LLAVA: Learning unified visual representations by a LanguageBind encoder before encoding with LLMs. [14] Bin Zhu, et al. "LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment." arXiv. 2023. 40

40. Generative LLM-Driven Video Understanding  Leveraging Generative LLMs is a booming research on Video-Language study, particularly for video understanding.  TimeChat: LLM-Driven Video-Language models for Time-sensitive Tasks [15] Shuhuai Ren, et al. " TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding." arXiv. 2023. 41

41. Generative LLM-Driven Video Understanding  Leveraging Generative LLMs is a booming research on Video-Language study, particularly for video understanding.  GPT-4V: What can we do if we combine video retrieval in LLMs? GPT-4 Turbo with Vision in Microsoft Azure Otter with Apple Vision Pro 42

42. Discussions for Future Work  The prosperity of LLMs and the development of NLP research have demonstrated that data-centric AI is a certain way to well-performing models with high generalization.  LLMs-Driven pre-trained multimodal models will be a popular multimodal learning paradigm.  Task-specific research will be highly competitive, and the task will be more and more complicated.  Potential future research interests may follows two ways: 1) embracing LLMs and 2) exploring model-agnostic theoretical evidence for multimodal video learning. VideoChat [Li, et al., arXiv’23] Video-LLAVA [Zhu, et al., arXiv’23] TimeChat [Ren, et al., arXiv’23] [12] KunChang Li, et al. " VideoChat: Chat-Centric Video Understanding." arXiv. 2023. [14] Bin Zhu, et al. "LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment." arXiv. 2023. [15] Shuhuai Ren, et al. " TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding." arXiv. 2023. 43

43. 敬请批评指正！ 44