美团内容智能分发的算法实践
如果无法正常显示,请先停止浏览器的去广告插件。
1. 美团技术沙龙080期:美团内容智能分发的算法实践
Aligning Vision and Language on Temporal Dimension:
Cross-modal Video Content Retrieval
视频-文本跨模态时序对齐与内容检索
徐 行 研究员
电子科技大学
未来媒体研究中心&计算机科学与工程学院
2023年12月23日
1
2. Introduction
Widely spread of massive untrimmed videos in our daily life
Untrimmed videos are usually redundant, but contain rich and informative fragments.
Using natural language text to retrieve key information from untrimmed videos is a
promising solution.
Videos on Internet
Mobile Video Apps
Video Surveillance
2
3. Introduction
Applications of Cross-Modal Video Content Retrieval
Searching related video content with given text query on single long-term video or
video galleries .
Retrieved video content can be entire video file or partial video clips .
(a) Cross-modal Video File Retrieval
(c) Video Content Retrieval in Video Gallery
(b) Video Content Retrieval in Single Video
3
4. Introduction
Our joint work with Meituan on Cross-modal Video Content Retrieval
Video Content Retrieval in Single Video
• SDN: Semantic Decoupling Network for Temporal Language Grounding. IEEE Transactions
on Neural Networks and Learning Systems (TNNLS), 2022. (中科院JCR 1区)
• GTLR: Graph-based Transformer with Language Reconstruction for Video Paragraph
Grounding, IEEE International Conference on Multimedia Expo, 2022 (最佳学生论文奖)
• Semi-supervised Video Paragraph Grounding with Contrastive Encoder, IEEE International
Conference on Computer Vision and Pattern Recognition, CVPR, 2022 (CCF A)
• Faster Video Moment Retrieval with Point-Level Supervision. ACM International
Conference on Multimedia (ACM MM), 2023. (CCF A)
Video Content Retrieval in Video Gallery
Joint Searching and Grounding: Multi-Granularity Video Content Retrieval.
ACM International Conference on Multimedia (ACM MM), 2023. (CCF A)
Code available: https://github.com/CFM-MSG
5
5. Introduction
Previous Methods
Proposal-based Methods: two-stage paradigm that generates
candidate moments then
conducts cross-modal alignments.
Proposal-free Methods: one-stage paradigm that predicts latent target moments directly.
Proposal-based method: CTRL [Gao et. al. ICCV 2017]
Proposal-free method: LGI [Mun et. al. CVPR 2020]
[1] Gao, Jiyang and Sun, Chen and Yang, Zhenheng and Nevatia, Ram. "Tall: Temporal activity localization via language query." ICCV. 2017.
[2] Mun, Jonghwan, Minsu Cho, and Bohyung Han. "Local-global video-text interactions for temporal grounding." CVPR. 2020.
6
6. Motivation
Observations
Proposal-based Methods: Achieving high recall rate with redundant candidate moments
but suffering from the constant-scale proposals.
Proposal-free Methods: Obtaining better flexibility and efficiency on temporal localization
but showing more weakness on recall rate.
(a) Proposal-based method
(c) Semantic Decoupling Network (Ours)
(b) Proposal-free method
If there exists a framework that inherits their merits and overcomes the defects?
7
7. Our Proposed Method
Semantic Decoupling
Assumption: the proposal-based method performs well because the semantics of video
segments can be determined by corresponding available temporal scales.
By observing the proportion of valid segments in each untrimmed video, we also find the
untrimmed video contains massive irrelevant content.
“Person”
“Door”
“Person”
“Door”
“Walking”
“Person walks into the
door to fetch his cloth”
(a) Semantic Completeness
(b) Statistical results on proportion of valid content
We argue that the more temporal information will turn static objectives into action and, eventually,
the complete events. The occurrence frequency also decreases according to the semantic level.
[3] Xun Jiang, Xing Xu*, Jingran Zhang, Fumin Shen, Zuo Cao and Heng Tao Shen, "SDN: Semantic Decoupling Network for Temporal Language 8
Grounding“, TNNLS, 2022
8. Our Proposed Method
Contributions
We propose a novel framework, Semantic Decoupling Network (SDN), which inherits the
benefits of proposal-based and proposal-free methods and overcomes their defects.
It consists of three key components: (1) Semantic Decoupling Module (SDM), (2) Context
Modeling Block (CMB), (3) Semantic Cross-level Aggregating Module (SCAM).
9
9. Experiments
Datasets
ActivityNet-Caption
More than 20k videos from Youtube
Open domain
More than 15k vocabularies
Long videos with 117s averaging
Complex semantics
Sufficient contexts
Charades-STA
More than 6k videos from Youtube
Indoors domain
About 1300 vocabularies
Short videos with 30s averaging
Simple semantics
Insufficient contexts
10
10. Experiments
Overall Comparisons
We evaluate our methods on three widely used datasets, i.e., Charades-STA, ActivityNet-
Caption, and TACoS.
The experimental results demonstrate our method outperforms both proposal-based and
proposal-free SOTA methods at the time.
Comparisons on TACoS
Comparisons on Charades-STA
Comparisons on ActivityNet-Caption
11
11. Experiments
Visualization of Retrieval Results
We visualize several event retrieval cases on short videos and long videos both, and make
a comparison with the SOTA methods at the time.
Our methods reveals obvious superiority on localizing events precisely (proposal-free) and
understanding global contextual information (proposal-based).
4 test cases from the Charades-STA (subfigure a, b) and TACoS (subfigure c, d) datasets.
12
12. Rethinking
Video Sentence Grounding (VSG)
Almost all the previous methods study the task that aims to localize an event
from an untrimmed video with a sentence query in natural language.
Untrimmed Video (Multiple Events)
…
Sentence Query
(Single Event)
Two young girls are
standing in the kitchen
preparing to cook.
…
Single-Multi Localization Target Moment
[CTRL Gao et al, 2017]
[MCN Hendricks et al, 2017]
[LGI Mun et al, 2020]
… [0s, 15.67s]
13
13. Rethinking
Video Paragraph Grounding
A new task that targets to localize multiple moments from an untrimmed video
with a given natural language paragraph query (multiple sentence queries).
Untrimmed Video (Multiple Events)
…
…
Paragraph Query
(Multiple Events)
After, the two
continue
to stir a the
They
then open
box
in are an
of contents
brownies…get
Two
young girls
egg out in of the
the kitchen
fridge.
standing
preparing to cook.
Target Moments
[0s, 15.67s]
Multi-Multi Localization
(Not fully explored)
[15.67s, 28.73s]
[50.12s, 57.59s]
14
14. Rethinking
Multi-Multi Localization: Chance and Challenge
The paragraph queries bring more context information, but also increase the
difficulty in multimodal representation and alignment
15
15. Our Proposed Method
Graph-based Transformer with Language Reconstruction (GTLR)
A. Multimodal Graph Encoder that conducts message passing among fine-grained features
B. Event-wise Decoder that explores contexts and achieve multi-multi localization
C. Language Reconstructor that aligns events with their corresponding sentences
C
A
od
En
Position Embedding
C
Concatenate
Clip-level Features
Sentence.3:
Sentence.2:
Word-level Features
Sentence.1:
A woman is
wrapping a box.
Sentence-level Features
B
[4] Xun Jiang, Xing Xu*, Jingran Zhang, Fumin Shen, Zuo Cao and Xunliang Cai, “GTLR: Graph-Based Transformer with Language
Reconstruction for Video Paragraph Grounding.” ICME. 2022. (Best Student Paper Award)
16
16. Experiments
Datasets
ActivityNet-Caption
More than 20k videos from Youtube
Open domain
More than 15k vocabularies
Long videos with 117s averaging
Complex semantics
Sufficient contexts
Charades-STA
More than 6k videos from Youtube
Indoors domain
About 1300 vocabularies
Short videos with 30s averaging
Simple semantics
Insufficient contexts
17
17. Experiments
Overall Comparison
Our method outperforms conventional VSG methods with a large margin.
It also achieves greatly improvement compared with the previous VPG method.
The GTLR performs better with more context information in paragraph queries.
Overall Comparison on Two Benchmark Datasets
ActivityNet-Caption
Charades-STA
Method Venue CBP ACL’20 54.3 35.76 17.8 36.85 - 36.80 18.87 -
CPNet AAAI’21 - 40.56 21.63 40.65 - 40.32 22.47 37.36
BPNet AAAI’21 58.98 42.07 24.69 42.11 55.46 38.25 20.51 38.03
SV-VMR ICME’21 61.39 45.21 27.32 - - 38.09 19.98 -
FVMR CVPR’21 53.12 41.48 29.12 - - 38.16 18.22 -
DepNet AAAI’21 72.81 55.91 33.46 - 58.47 40.23 21.08 39.11
GTLR Ours 77.49 60.80 37.73 55.20 62.96 44.31 22.80 42.37
IoU=0.3 IoU=0.5 IoU=0.7
mIoU
IoU=0.3 IoU=0.5 IoU=0.7
mIoU
18
18. Experiments
Qualitative Analysis
The GTLR achieves remarkable improvement on the localizing precision.
Language Reconstruction makes it more explainable and intractable.
As constraints for multimodal alignments, the reconstructed words show a semantic-clustering
tendency.
Visualization of two typical grounding examples by our GTLR and the counterpart DepNet [AAAI’21]
19
19. Visualization Demo
Cross-modal Video Content Retrieval
Our GTLR method effectively aligns two modalities and parallelly retrieval multiple events.
This work is also presented for the ICME 2022 Best Student Paper Award.
Query: Two young girls stand in a kitchen, with one
holding a bag of something and talking and the other
standing on a chair. Then, The two collaborate to …
A case from the ActivityNet-Caption dataset
ICME 2022 Best Student Paper
20
20. Discussion
Existing Challenges for VPG
Intra-modality contextual information is not explored sufficient in GTLR because of
the fine-grained cross-modal interactions.
Video-Paragraph pairs require heavy annotation cost compared with Video-
Sentence pairs.
Pair 1
Pair 2
Pair 5
Video-Paragraph Annotations
Video-Sentence Annotations
21
21. Our Proposed Method
Semi-supervised Video-Paragraph Transformer (SVPTR)
We first build a base model, namely, Video-Paragraph TRansformer (VPTR).
Based on the fully-supervised base model VPTR, we introduce the mean-teacher
framework thus achieve the Semi-supervised Learning VPG.
[5] Xun Jiang, Xing Xu*, Jingran Zhang, Fumin Shen, Zuo Cao, Heng Tao Shen, “Semi-Supervised Video Paragraph Grounding With
Contrastive Encoder.” CVPR. 2022.
22
22. Experiments
Overall Comparisons
We evaluate our method under fully-supervised and semi-supervised settings
respectively. It shows competitive performance with less temporal annotated data.
Our proposed SVPTR method achieves remarkable improvement compared with
both previous VSG and VPG methods.
Semi-supervised Performance on the ActivityNet-
Caption, TACoS, and Charades-CD-OOD
Fully-supervised Performance
on the TACoS
Fully-supervised Performance
on the ActivityNet-Caption
23
23. Experiments
Further Analysis and Visualization
The experimental results on different proportions of labeled training data show the
effectiveness of our SVPTR method.
We also visualize the attention weights of decoders to furtherly illustrate the
process of event-wise cross-modal alignments.
Analysis on the annotated data
Visualization of attention weights in event-wise decoder
24
24. The Latest Works
Exploration on Model Practice: Data and Computation
Existing methods follows a multimodal fusion pipeline to retrieve video content,
which leads to heavy computational consumption.
Completely temporal annotations are subjective and hard to label thus bringing
high data cost.
We achieve more practical retrieval paradigm with
a fair trade-off among data cost, model efficiency, and retrieval accuracy!
25
25. Our Proposed Method
Cheaper and Faster Moment Retrieval (CFMR)
We proposed Concept-based Multimodal Alignment module that avoids online
cross-modal interactions widely used in previous methods.
We introduced point-level supervised learning to reduce annotation cost and Point-
guided Contrastive Learning to leverage incomplete supervision signals.
[6] Xun Jiang, Zailei Zhou, Xing Xu*, Yang Yang, Guoqing Wang, Heng Tao Shen, Faster Video Moment Retrieval with Point-Level
Supervision. ACM MM. 2023.
26
26. Experiments
Retrieval Accuracy
Similarly, we evaluate our proposed CFMR method on the Charades-STA,
ActivityNet-Captions, and TACoS datasets.
During training, the complete temporal annotations are inaccessible, but only point-
level supervision provided.
27
27. Experiments
Retrieval Efficiency
In the inference, the semantic reconstructor is disabled and only two concept
encoders are deployed, thus no cross-modal interactions or fusion is required.
As our method bypasses cross-modal interactions, the online computation, which
can be only propelled after receiving users’ queries, has been reduced significantly.
28
28. Experiments
Further Analysis
We furtherly analyze our method on ablation studies, retrieval efficiency on long-
term videos with different lengths, visualizations of multimodal representation, and
performance on out-of-distribution datasets.
Ablation
Studies
Efficiency
Analysis
Multimodal
Representation
OOD
Performance
29
29. Experiments
Further Analysis
Finally, we also conduct qualitative analysis to make comprehensive comparisons on
retrieval performance, efficiency, and annotation cost.
30
30. The Latest Works
Towards Applications: Retrieving Content from Video Collections
Exploring a Novel Fine-grained Video Retrieval Paradigm: a unified framework for
retrieving Video-Text Retrieval and Video Content Retrieval.
Existing fusion-based methods are NOT compatible with video-text retrieval, as it is
IMPOSSIBLE to conduct cross-modal interactions with each video in collections.
Following the text-based video retrieval seetings: NO additional temporal
annotations.
31
31. Multi-Granularity Video Content Retrieval
Our Solution: Joint Search and Grounding (JSG)
Retrieving untrimmed videos that contains multiple events with partial text queries
from video collections and synchronously retrieving the most related video events.
We design two branches to tackle this challenging problem, i.e., Glance Branch and
Gaze Branch, which retrieve video files and events collaboratively.
[7] Zhiguo Chen, Xun Jiang, Xing Xu*, Zuo Cao, Yijun Mo, Heng Tao Shen, Joint Searching and Grounding: Multi-Granularity Video Content
Retrieval. ACM MM. 2023.
32
32. Experiments
Evaluate on Two Subtasks
We evaluate our JSG method on both video-level retrieval and event-level retrieval
tasks, which comprehensively show the multi-granularity video retrieval performance.
Note that ALL temporal annotations are INACCESSIBLE.
33
33. Experiments
Further Analysis
We also conduct more ablation studies and observe the retrieval performance on two
branches, i.e., Glance Branch and Gaze Branch, to take quantitative analysis on our
proposed JSG method.
34
34. Experiments
Further Analysis
Here we demonstrate some retrieval cases from our experimental datasets.
35
35. Generative LLM-Driven Video Understanding
The prosperity of LLMs and the development of NLP research have demonstrated that data-centric
AI is a certain way to well-performing models with high generalization.
LLMs-Driven pre-trained multimodal models will be a popular multimodal learning paradigm.
Video-Text Alignment may by an essential topic in “Video+LLM” studies.
Mini-GPT[Zhu, et al., arXiv’23]
BLIP-2 [Li, et al., arXiv’23]
LLAVA [Liu, et al., arXiv’23]
[8] Deyao Zhu, et al. "MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models." arXiv. 2023.
[9] Junnan Li, et al. "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models." arXiv. 2023.
[10] Haotian Liu, et al. “Visual Instruction Tuning”. arXiv. 2023.
36
36. Generative LLM-Driven Video Understanding
Leveraging Generative LLMs is a booming research on Video-Language study, particularly for video
understanding.
Video-ChatGPT
[11] Maaz, Muhammad, et al. "Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models." arXiv. 2023.
37
37. Generative LLM-Driven Video Understanding
Leveraging Generative LLMs is a booming research on Video-Language study, particularly for video
understanding.
VideoChat (AskAnything)
[12] KunChang Li, et al. " VideoChat: Chat-Centric Video Understanding." arXiv. 2023.
38
38. Generative LLM-Driven Video Understanding
Leveraging Generative LLMs is a booming research on Video-Language study, particularly for video
understanding.
Video-LLAMA: Bridging video/audio models and generative LLMs with Q-Former.
[13] Hang Zhang, et al. "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding." EMNLP. 2023.
39
39. Generative LLM-Driven Video Understanding
Leveraging Generative LLMs is a booming research on Video-Language study, particularly for video
understanding.
Video-LLAVA: Learning unified visual representations by a LanguageBind encoder before encoding
with LLMs.
[14] Bin Zhu, et al. "LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment." arXiv. 2023.
40
40. Generative LLM-Driven Video Understanding
Leveraging Generative LLMs is a booming research on Video-Language study, particularly for video
understanding.
TimeChat: LLM-Driven Video-Language models for Time-sensitive Tasks
[15] Shuhuai Ren, et al. " TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding." arXiv. 2023.
41
41. Generative LLM-Driven Video Understanding
Leveraging Generative LLMs is a booming research on Video-Language study, particularly for video
understanding.
GPT-4V: What can we do if we combine video retrieval in LLMs?
GPT-4 Turbo with Vision in Microsoft Azure
Otter with Apple Vision Pro
42
42. Discussions for Future Work
The prosperity of LLMs and the development of NLP research have demonstrated that data-centric
AI is a certain way to well-performing models with high generalization.
LLMs-Driven pre-trained multimodal models will be a popular multimodal learning paradigm.
Task-specific research will be highly competitive, and the task will be more and more complicated.
Potential future research interests may follows two ways: 1) embracing LLMs and 2) exploring
model-agnostic theoretical evidence for multimodal video learning.
VideoChat [Li, et al., arXiv’23]
Video-LLAVA [Zhu, et al., arXiv’23]
TimeChat [Ren, et al., arXiv’23]
[12] KunChang Li, et al. " VideoChat: Chat-Centric Video Understanding." arXiv. 2023.
[14] Bin Zhu, et al. "LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment." arXiv. 2023.
[15] Shuhuai Ren, et al. " TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding." arXiv. 2023.
43
43. 敬请批评指正!
44