AIGC驱动的3D场景理解及医学图像解析
如果无法正常显示,请先停止浏览器的去广告插件。
1. AIGC驱动的3D场景理解及
医学图像解析
香港中文大学(深圳)
助理教授
李镇博士
2. 1 讲者介绍
香港大学博士(师从余益州教授),芝加哥大学访问学者(师从许锦波教授)
香港中文大学(深圳)理工学院/未来智联网络研究院 助理院长/教授,校长青年学者
香港中文大学(深圳)深度比特实验室主任
博士后:1名,博士生:8名,研究生:2名
人才
荣誉
CASP12 接触图预测全球冠军,并作为AlphaFoldV1的基线方案
PLOS CB 2018年创新与突破奖 (一年一例)
中国科协 2019年青年托举人才
2022年05月CAMEO蛋白打分月度第一,2022 SemancticKITTI分割竞赛第一,2023
CVPR
李镇
助理教授
FNII 助理院长
HOI4D 分割竞赛第一,2018全球气象预测大赛第一,ICCV 2022 Urban3D第二等
主持国家自然科学基金青年项目1项
科研
学术
主持深港A类项目“深度学习辅助的RNA蛋白结构预测以及蛋白高亲和性RNA设计 ”(300
万)
CCF-腾讯犀牛鸟2019优秀奖,2022年犀牛鸟专项
参与科技部国家重点研发项目
合作牵头国家自然科学基金重点项目,合作牵头粤深联合基金重点项目
2
3. 目录
• AIGC驱动的3D室内场景稠密描述及视觉定位
• AIGC驱动的3D高精度的说话人脸驱动及生成
• AIGC驱动的结肠镜图片生成及解析
4. 案例简介
• 300字以内进行概括性的案例介绍(突出亮点、案例独特性等)
随着AIGC和ChatGPT等生成模型的迅速发展,我们探索出AIGC驱动的3D场景理解以及医疗场景的
分析,并通过一系列自研的算法和工具,对AIGC算法辅助的下游应用进行了深入地研究,从3D
场景的自动稠密描述,到室内场景的视觉定位,再到3D视觉驱动的高保真说话人脸生成,并推
广到AIGC辅助的医疗场景的解析,我们均进行了深入地探讨。在本次分享中,我们将会从3D场
景描述和定位,3D说话人脸生成,生成图片辅助的肠胃镜图片解析等方面,详解介绍我们应用
方案的架构设计与工程实践,同时也会基于我们的经验分享在使用AIGC驱动的3D场景理解和医
疗图像理解过程中的思考和对未来AIGC演进的展望。
5. 目录
• AIGC驱动的3D室内场景稠密描述及视觉定位
• AIGC驱动的3D高精度的说话人脸驱动及生成
• AIGC驱动的结肠镜图片生成及解析
6. InstanceRefer: Cooperative Holistic Understanding for
Visual Grounding on Point Clouds through Instance Multi-
level Contextual Referring
Zhihao Yuan 1,† , Xu Yan 1,† , Yinghong Liao 1 , Ruimao Zhang 1
Sheng W ang 2 , Zhen Li 1,* , and Shuguang Cui 1
The Chinese University of Hong Kong (Shenzhen),
Shenzhen Research Institute of Big Data
2 CryoEM Center, Southern University of Science and Technology
1
7. Background
Visual Grounding:
Visual grounding (VG) aims at localizing the desired objects or areas in an image or a
3D scene based on an object-related linguistic query
ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language
8. Background
ScanRefer:
1. Exploiting object detection to generate proposal candidates;
2. Localize described object by fusing language features into candidates.
ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language
9. Background
ScanRefer:
Cons:
1. The object proposals in the large 3D scene are usually redundant;
2. The appearance and attribute information is not sufficiently captured;
3. The relations among proposals and the ones between proposals and background
are not fully studied.
• ScanRefer generates 114 possible candidates after filtering
proposals by their objectness scores;
• Each proposal’s feature is generated by the detection framework;
• There is no relation reasoning among proposals
ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language
10. Method
InstanceRefer:
1. Instance-level candidate representation (small number);
2. Multi-level contextual inference (attribute, objects’ relation and environment).
11. Method
InstanceRefer Architecture:
Language feature encoding (the same as ScanRefer).
Description
There is a gray and blue
leather chair. Placed in a
raw with other chairs in
the side of the wall.
Word Embedding W
GloVE
Word Features E
BiGRU
12. Method
InstanceRefer Architecture:
Extracting instances through panoptic segmentation (predict instance and semantics).
Description
There is a gray and blue
leather chair. Placed in a
raw with other chairs in
the side of the wall.
Word Embedding W
GloVE
Word Features E
BiGRU
Instance Mask I
(Table)
(Chair)
Panoptic
Segmentatio
n
extract
... ...
(Chair)
(Chair)
Input Point Cloud P
Semantics S
Instances
13. Method
InstanceRefer Architecture:
Eliminating irrelative instances by the target category (inferred by language).
Description
There is a gray and blue
leather chair. Placed in a
raw with other chairs in
the side of the wall.
Word Embedding W
GloVE
Word Features E
BiGRU
Target
Prediction
Instance Mask I
(Table)
“Chair”
(Chair)
Panoptic
Segmentation
extract
... ...
(Chair)
filter
(chair only)
... ...
(Chair)
Input Point Cloud P
Semantics S
Instances
Candidates
14. Method
InstanceRefer Architecture:
Generating visual feature of each candidate by multi-level referring (three novel
modules are proposed).
Description
There is a gray and blue
leather chair. Placed in a
raw with other chairs in
the side of the wall.
Word Embedding W
GloVE
Word Features E
BiGRU
Target
Prediction
Instance Mask I
(Table)
“Chair”
(Chair)
Panoptic
Segmentatio
n
extract
... ...
(Chair)
filter
(chair only)
... ...
(Chair)
Input Point Cloud P
Semantics S
Instances
Candidates
AP
RP GLP
, , P,
Multi-Level
Visual Context
15. Method
InstanceRefer Architecture:
Scoring each candidate matching language and visual features (the candidate
with the largest score will be regarded as output).
Description
There is a gray and blue
leather chair. Placed in a
raw with other chairs in
the side of the wall.
Word Embedding W
GloVE
Word Features E
BiGRU
(0.95)
Attention
Pooling
Target
Prediction
matching
... ...
Instance Mask I
(Table)
(0.03)
“Chair”
(Chair)
Panoptic
Segmentatio
n
extract
Similarity
Score Q
... ...
(Chair)
filter
(chair only)
... ...
(Chair)
Input Point Cloud P
Semantics S
Instances
(0.31)
Candidates
AP
RP GLP
, , P,
Multi-Level
Visual Context
16. Method
Specific Modules:
(a) Attribute Perception (AP) Module.
• It construct a four-layer Sparse Convolution
(SparseConv) as the feature extractor;
• A fter an average pooling, the global attribute
perception feature is obtained.
17. Method
Specific Modules:
(b) Relation Perception (RP) Module.
• It uses k-nearest neighbors to construct a graph, where
nodes’ features are their semantics obtained by
panoptic segmentation and edges are consisted of their
semantics and relative position;
• Dynamic graph convolution network (DGCNN) is
exploited to update the node’s feature
18. Method
Specific Modules:
(c) Global Localization Perception (GLP) Module.
• It uses SparseConv layers with height-pooling
to generate a 3 × 3 bird-eyes-view (BEV)
plane;
• By combining language feature, it predicts
which grid the target object is located in;
• It interpolates probabilities and generates the
global perception features by merging
features from AP module.
19. Method
Specific Modules:
(d) Matching Module
• A naive version by using Cosine similarity;
• An enhance version by using modular co-attention from MCAN [1].
(e) Contrastive Objective
where Q+ and Q− denote the scores of positive and negative pairs.
[1] Deep modular co-attention networks for visual question answering
20. Results
ScanRefer:
21. Results
22. Results
23. Results
Nr3D/Sr3D:
24. InstanceRefer: Cooperative Holistic Understanding for Visual
Grounding on Point Clouds through Instance Multi-level
Contextual Referring
Thanks for watching !
Zhihao Yuan 1,† , X u Y an 1,† , Yinghong Liao 1 , Ruimao Zhang 1
Sheng Wang 2 , Zhen Li 1,* , and Shuguang Cui 1
The Chinese University of Hong Kong (Shenzhen),
Shenzhen Research Institute of Big Data
2 CryoEM Center, Southern University of Science and Technology
1
25. X-Trans2Cap: Cross-Modal Knowledge Transfer using
Transformer for 3D Dense Captioning
Zhihao Y uan 1,† , X u Y an 1,† , Yinghong Liao 1 ,
Yao Guo 2 , Guanbin Li 3 , Shuguang Cui 1 , Zhen Li 1,*
The Chinese University of Hong Kong (Shenzhen),
The Future Network of Intelligence Institute,
Shenzhen Research Institute of Big Data,
2 Shanghai Jiao Tong University, 3 Sun Yat-sen University
1
26. Background
Task Description (3D Dense Captioning)
Scan2Cap: Context-aware Dense Captioning in RGB-D Scans Dave
27. Background
Limitations
• The object representations in Scan2Cap are defective
since they are solely learned from sparse 3D point
clouds, thus failing to provide strong texture and
color information compared with the ones generated
from 2D images.
• It requires the extra 2D input in both training and
inference phases. However, the extra 2D information
is usually computation intensive and unavailable
during inference.
28. X-Trans2Cap
Motivation
• We propose a Cross-Modal
Knowledge Transfer framework on
3D dense captioning task.
• During the training phase, the teacher
network exploits auxiliary 2D
modality and guides the student
network that only takes point clouds
as input through the feature
consistency constraints.
• A more faithful caption can be
generated only using point clouds
during the inference.
29. X-Trans2Cap
2D and 3D Inputs
3D Proposals
2D Proposals
30. X-Trans2Cap
2D and 3D Inputs
Reference object
Target object
3D Proposals 2D Proposals
... ...
3D Modal Inputs Multi-Modal Inputs
31. X-Trans2Cap
Architecture
3D Modal Inputs
...
� 3�
Encoder
Layer 1
...
Encoder
Layer L
Decoder
Layer
Ground Truth
Target object
L ce
Reference object
Descriptions
both in training and inference
32. X-Trans2Cap
Architecture
Multi-Modal Inputs
...
3D Modal Inputs
� 푚푢�
...
� 3�
Encoder
Layer 1
Encoder
Layer 1
Teacher
...
Student
Encoder
Layer L
Encoder
Layer L
Decoder
Layer
Descriptions
Target object
...
Decoder
Layer
L ce
Ground Truth
L ce
Reference object
Descriptions
both in training and inference
33. X-Trans2Cap
Architecture
Multi-Modal Inputs
...
3D Modal Inputs
� 푚푢�
...
� 3�
Encoder
Layer 1
Encoder
Layer 1
Teacher
...
Student
...
Encoder
Layer L
Encoder
Layer L
L align
Decoder
Layer
Descriptions
Target object
Feature
Alignment
L ce
Ground Truth
Decoder
Layer
L ce
Reference object
Descriptions
both in training and inference
34. X-Trans2Cap
Architecture
Multi-Modal Inputs
...
� 푚푢�
...
� 3�
CMF at
Level 1
Encoder
Layer 1
Teacher
3D Modal Inputs
...
...
Encoder
Layer 1
Student
CMF at
Level L
Encoder
Layer L
...
Encoder
Layer L
L align
Decoder
Layer
Descriptions
Target object
Feature
Alignment
L ce
Ground Truth
Decoder
Layer
L ce
Reference object
Descriptions
both in training and inference
35. X-Trans2Cap
Cross-Modal Fusion (CMF) Module
36. Experiments
3D Dense Captioning with Gound Truth Proposals
(Nr3D and ScanRefer)
37. Experiments
3D Dense Captioning with Detection Proposals
(Nr3D and ScanRefer)
38. Experiments
Visualization
39. X-Trans2Cap: Cross-Modal K nowledge Transfer using
Transformer for 3D Dense Captioning
Thanks for watching!
Zhihao Yuan 1,† , Xu Yan 1,† , Y inghong Liao 1 ,
Y ao Guo 2 , Guanbin Li 3 , Shuguang Cui 1 , Zhen Li 1,*
The Chinese University of Hong Kong (Shenzhen),
The Future Network of Intelligence Institute,
Shenzhen Research Institute of Big Data,
2 Shanghai Jiao Tong University, 3 Sun Yat-sen University
1
40. 目录
• AIGC驱动的3D室内场景稠密描述及视觉定位
• AIGC驱动的3D高精度的说话人脸驱动及生成
• AIGC驱动的结肠镜图片生成及解析
41. 说话人脸
任务介绍
语音/文本
深度网络
人脸图片/视
频
输入
模型结构
说话人脸视频
输出
目标:
• 给定文本或语音作为驱动信息,同时给定人脸图片或视频提供人物信息,
目标是生成人脸视频其嘴型与文本或语音内容保持一致;
挑战:
• 跨模态学习任务,语音/文本模态到图像模态的映射,需要设计多模态
的特征提取器以及跨模态交互学习;
• 人类视觉系统对生成视频图像质量和语音-嘴型同步质量比较敏感,生
成高质量的说话人脸视频有挑战;
42. 说话人脸
已有成果
时序3DMM方案(利用点云解析相关算法预测脸部关键点)
• 利用精细3D人脸顶点进行语音到嘴型的显式监督,并考虑长时时序信
息,得到稳定人脸视频;
43. 点云解析驱动的高清说话人脸
生成结果
3D animation
Blend result
Generated result
44. 点云解析驱动的高清说话人脸
不同语言生成结果
中文
德语
45. 点云解析驱动的高清说话人脸
不同语言生成结果
粤语
英文
46. 目录
• AIGC驱动的3D室内场景稠密描述及视觉定位
• AIGC驱动的3D高精度的说话人脸驱动及生成
• AIGC驱动的结肠镜图片生成及解析
47. ArSDM: Colonoscopy Images Synthesis
with Adaptive Refinement Diffusion Models
48. Background
(( )Colonoscopy analysis, particularly automatic polyp segmentation and detection are essential for assisting clinical
diagnosis and treatment, while the scarcity of annotated data limits the effectiveness and generalization of existing
models.
(( )The quality of generated data by GANs or other data augmentation methods is poor.
(( )Diffusion models have demonstrated remarkable progress in generating multiple modalities of medical data (CT,
MRI, …).
49. Overview of the Pipeline
GT Masks
Original Images
Downstream Tasks
Diffusion Sampler
Segmentation
Detection
… …
ArSDM
Synthesis Images
e.g.
e.g.
50. Pipeline
(( )Train a semantic diffusion model (our ArSDM).
(( )For each mask in the training set, sample a synthesized image, the synthesized dataset has the same number of image-
mask pairs as the original dataset.
(( )Combine the original diffusion training set with the synthesized dataset for training polyp segmentation and detection
models
51. Model
Architecture
Condition
Diffusion
Process
Input
�
U-Net
W eights Map � �
�
Estimated
Noised
Re-Weighting
Module
GT Mask
Sample
Diffusion Loss
ℒ 풂풅풂풑
Refinement Loss
ℒ 풓 풇
PraNet
Prediction Mask
52. Model Architecture
• Mask Conditioning
• Using the segmentation masks as conditions, similar to semantic masks but have only two categories:
foreground (polyp) and background (intestine wall)
• The conditional U-Net model is the same as SDM (Semantic Image Synthesis via Diffusion Models
https://arxiv.org/abs/2207.00050)
• Adaptive Loss Function
• Based on � 1 loss, define a pixel-wise weights matrix that vests different weights according to the
size ratio of the polyp over the background.
• For coding, it is convenient to use the pixel values of the segmentation mask (0,1).
53. Model Architecture
• Mask Conditioning
• Adaptive Loss Function
• Refinement
• Using a pre-trained segmentation model to fine-tune the diffusion model, in which the U-Net
parameters are updated while the segmentation model parameters are fixed.
• For each time-step , we need to sample an image, which is time-consuming.
54. Experimental Settings
Diffusion Training
• Training Set: Kvasir + CVC-ClinicDB (1450 image-mask pairs)
• Image Size: Padding to have the same height and width and then resize to 384 × 384
• Duration:
• with Refinement: around one-half NVIDIA A100 days (80GB Memory)
• w/o refinement: around one A100 day.
Diffusion Sampling
• DDIM sampler with � = 200
• Random noise as input and mask as a condition
55. Comparison Results
Polyp Segmentation
56. Comparison Results
Polyp Detection
57. Visualization
58. Colonscopy Video Generation
with Diffusion Models
59. PVDM
用sky-time-lapse训练autoencoder, from scratch
Dataset: Sky Time-lapse
997 段视频; 总共 1,172,641 帧
Inputs
Training: 1 V100, 1 day
Reconstructions
60. PVDM
用LDPolyp训练autoencoder, 加载sky-time-lapse的权重
Dataset: LDPolyp
100 段视频; 总共 24,789 帧
Training: 1 V100, 1.6 days
Inputs
Reconstructions
61. LVDM
用LDPolyp训练 2D autoencoder, 加载ImageNet训练权重
Dataset: LDPolyp
100 段视频; 总共 24,789 帧
Training: 1 V100, 1 day
Inputs
Reconstructions
62. 用LDPolyp训练 unconditional diffusion model
LVDM-2
(LVDM 5.1 release codes)
Dataset: LDPolyp
100 段视频; 总共 24,789 帧
Training: 1 80GB A100, 1 day
Details: 3D-Unet 添加部分Attention
layer 作attention
参数过多,batch_size 2 -> 48 GB 显存
Samples
√
63. 下一步启示
• 进一步优化多模态在3D场景的解析与生成
• 结合video diffusion来强化说话人脸的效果
• 结合condition mask来进行医疗图像场景的video diffusion生成