AIGC驱动的3D场景理解及医学图像解析

1. AIGC驱动的3D场景理解及医学图像解析香港中文大学（深圳）助理教授李镇博士

2. 1 讲者介绍  香港大学博士（师从余益州教授），芝加哥大学访问学者（师从许锦波教授）  香港中文大学（深圳）理工学院/未来智联网络研究院助理院长/教授，校长青年学者  香港中文大学（深圳）深度比特实验室主任博士后：1名，博士生：8名，研究生：2名人才荣誉  CASP12 接触图预测全球冠军，并作为AlphaFoldV1的基线方案  PLOS CB 2018年创新与突破奖（一年一例）  中国科协 2019年青年托举人才  2022年05月CAMEO蛋白打分月度第一，2022 SemancticKITTI分割竞赛第一，2023 CVPR 李镇助理教授 FNII 助理院长 HOI4D 分割竞赛第一，2018全球气象预测大赛第一，ICCV 2022 Urban3D第二等  主持国家自然科学基金青年项目1项科研学术  主持深港A类项目“深度学习辅助的RNA蛋白结构预测以及蛋白高亲和性RNA设计 ”（300 万）  CCF-腾讯犀牛鸟2019优秀奖，2022年犀牛鸟专项  参与科技部国家重点研发项目  合作牵头国家自然科学基金重点项目，合作牵头粤深联合基金重点项目 2

3. 目录 • AIGC驱动的3D室内场景稠密描述及视觉定位 • AIGC驱动的3D高精度的说话人脸驱动及生成 • AIGC驱动的结肠镜图片生成及解析

4. 案例简介 • 300字以内进行概括性的案例介绍（突出亮点、案例独特性等）随着AIGC和ChatGPT等生成模型的迅速发展，我们探索出AIGC驱动的3D场景理解以及医疗场景的分析，并通过一系列自研的算法和工具，对AIGC算法辅助的下游应用进行了深入地研究，从3D 场景的自动稠密描述，到室内场景的视觉定位，再到3D视觉驱动的高保真说话人脸生成，并推广到AIGC辅助的医疗场景的解析，我们均进行了深入地探讨。在本次分享中，我们将会从3D场景描述和定位，3D说话人脸生成，生成图片辅助的肠胃镜图片解析等方面，详解介绍我们应用方案的架构设计与工程实践，同时也会基于我们的经验分享在使用AIGC驱动的3D场景理解和医疗图像理解过程中的思考和对未来AIGC演进的展望。

5. 目录 • AIGC驱动的3D室内场景稠密描述及视觉定位 • AIGC驱动的3D高精度的说话人脸驱动及生成 • AIGC驱动的结肠镜图片生成及解析

6. InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi- level Contextual Referring Zhihao Yuan 1,† , Xu Yan 1,† , Yinghong Liao 1 , Ruimao Zhang 1 Sheng W ang 2 , Zhen Li 1,* , and Shuguang Cui 1 The Chinese University of Hong Kong (Shenzhen), Shenzhen Research Institute of Big Data 2 CryoEM Center, Southern University of Science and Technology 1

7. Background Visual Grounding: Visual grounding (VG) aims at localizing the desired objects or areas in an image or a 3D scene based on an object-related linguistic query ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language

8. Background ScanRefer: 1. Exploiting object detection to generate proposal candidates; 2. Localize described object by fusing language features into candidates. ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language

9. Background ScanRefer: Cons: 1. The object proposals in the large 3D scene are usually redundant; 2. The appearance and attribute information is not sufficiently captured; 3. The relations among proposals and the ones between proposals and background are not fully studied. • ScanRefer generates 114 possible candidates after filtering proposals by their objectness scores; • Each proposal’s feature is generated by the detection framework; • There is no relation reasoning among proposals ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language

10. Method InstanceRefer: 1. Instance-level candidate representation (small number); 2. Multi-level contextual inference (attribute, objects’ relation and environment).

11. Method InstanceRefer Architecture: Language feature encoding (the same as ScanRefer). Description There is a gray and blue leather chair. Placed in a raw with other chairs in the side of the wall. Word Embedding W GloVE Word Features E BiGRU

12. Method InstanceRefer Architecture: Extracting instances through panoptic segmentation (predict instance and semantics). Description There is a gray and blue leather chair. Placed in a raw with other chairs in the side of the wall. Word Embedding W GloVE Word Features E BiGRU Instance Mask I (Table) (Chair) Panoptic Segmentatio n extract ... ... (Chair) (Chair) Input Point Cloud P Semantics S Instances

13. Method InstanceRefer Architecture: Eliminating irrelative instances by the target category (inferred by language). Description There is a gray and blue leather chair. Placed in a raw with other chairs in the side of the wall. Word Embedding W GloVE Word Features E BiGRU Target Prediction Instance Mask I (Table) “Chair” (Chair) Panoptic Segmentation extract ... ... (Chair) filter (chair only) ... ... (Chair) Input Point Cloud P Semantics S Instances Candidates

14. Method InstanceRefer Architecture: Generating visual feature of each candidate by multi-level referring (three novel modules are proposed). Description There is a gray and blue leather chair. Placed in a raw with other chairs in the side of the wall. Word Embedding W GloVE Word Features E BiGRU Target Prediction Instance Mask I (Table) “Chair” (Chair) Panoptic Segmentatio n extract ... ... (Chair) filter (chair only) ... ... (Chair) Input Point Cloud P Semantics S Instances Candidates AP RP GLP , , P, Multi-Level Visual Context

15. Method InstanceRefer Architecture: Scoring each candidate matching language and visual features (the candidate with the largest score will be regarded as output). Description There is a gray and blue leather chair. Placed in a raw with other chairs in the side of the wall. Word Embedding W GloVE Word Features E BiGRU (0.95) Attention Pooling Target Prediction matching ... ... Instance Mask I (Table) (0.03) “Chair” (Chair) Panoptic Segmentatio n extract Similarity Score Q ... ... (Chair) filter (chair only) ... ... (Chair) Input Point Cloud P Semantics S Instances (0.31) Candidates AP RP GLP , , P, Multi-Level Visual Context

16. Method Specific Modules: (a) Attribute Perception (AP) Module. • It construct a four-layer Sparse Convolution (SparseConv) as the feature extractor; • A fter an average pooling, the global attribute perception feature is obtained.

17. Method Specific Modules: (b) Relation Perception (RP) Module. • It uses k-nearest neighbors to construct a graph, where nodes’ features are their semantics obtained by panoptic segmentation and edges are consisted of their semantics and relative position; • Dynamic graph convolution network (DGCNN) is exploited to update the node’s feature

18. Method Specific Modules: (c) Global Localization Perception (GLP) Module. • It uses SparseConv layers with height-pooling to generate a 3 × 3 bird-eyes-view (BEV) plane; • By combining language feature, it predicts which grid the target object is located in; • It interpolates probabilities and generates the global perception features by merging features from AP module.

19. Method Specific Modules: (d) Matching Module • A naive version by using Cosine similarity; • An enhance version by using modular co-attention from MCAN [1]. (e) Contrastive Objective where Q+ and Q− denote the scores of positive and negative pairs. [1] Deep modular co-attention networks for visual question answering

20. Results ScanRefer:

21. Results

22. Results

23. Results Nr3D/Sr3D:

24. InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring Thanks for watching ! Zhihao Yuan 1,† , X u Y an 1,† , Yinghong Liao 1 , Ruimao Zhang 1 Sheng Wang 2 , Zhen Li 1,* , and Shuguang Cui 1 The Chinese University of Hong Kong (Shenzhen), Shenzhen Research Institute of Big Data 2 CryoEM Center, Southern University of Science and Technology 1

25. X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning Zhihao Y uan 1,† , X u Y an 1,† , Yinghong Liao 1 , Yao Guo 2 , Guanbin Li 3 , Shuguang Cui 1 , Zhen Li 1,* The Chinese University of Hong Kong (Shenzhen), The Future Network of Intelligence Institute, Shenzhen Research Institute of Big Data, 2 Shanghai Jiao Tong University, 3 Sun Yat-sen University 1

26. Background Task Description (3D Dense Captioning) Scan2Cap: Context-aware Dense Captioning in RGB-D Scans Dave

27. Background Limitations • The object representations in Scan2Cap are defective since they are solely learned from sparse 3D point clouds, thus failing to provide strong texture and color information compared with the ones generated from 2D images. • It requires the extra 2D input in both training and inference phases. However, the extra 2D information is usually computation intensive and unavailable during inference.

28. X-Trans2Cap Motivation • We propose a Cross-Modal Knowledge Transfer framework on 3D dense captioning task. • During the training phase, the teacher network exploits auxiliary 2D modality and guides the student network that only takes point clouds as input through the feature consistency constraints. • A more faithful caption can be generated only using point clouds during the inference.

29. X-Trans2Cap 2D and 3D Inputs 3D Proposals 2D Proposals

30. X-Trans2Cap 2D and 3D Inputs Reference object Target object 3D Proposals 2D Proposals ... ... 3D Modal Inputs Multi-Modal Inputs

31. X-Trans2Cap Architecture 3D Modal Inputs ... � 3� Encoder Layer 1 ... Encoder Layer L Decoder Layer Ground Truth Target object L ce Reference object Descriptions both in training and inference

32. X-Trans2Cap Architecture Multi-Modal Inputs ... 3D Modal Inputs � 푚푢� ... � 3� Encoder Layer 1 Encoder Layer 1 Teacher ... Student Encoder Layer L Encoder Layer L Decoder Layer Descriptions Target object ... Decoder Layer L ce Ground Truth L ce Reference object Descriptions both in training and inference

33. X-Trans2Cap Architecture Multi-Modal Inputs ... 3D Modal Inputs � 푚푢� ... � 3� Encoder Layer 1 Encoder Layer 1 Teacher ... Student ... Encoder Layer L Encoder Layer L L align Decoder Layer Descriptions Target object Feature Alignment L ce Ground Truth Decoder Layer L ce Reference object Descriptions both in training and inference

34. X-Trans2Cap Architecture Multi-Modal Inputs ... � 푚푢� ... � 3� CMF at Level 1 Encoder Layer 1 Teacher 3D Modal Inputs ... ... Encoder Layer 1 Student CMF at Level L Encoder Layer L ... Encoder Layer L L align Decoder Layer Descriptions Target object Feature Alignment L ce Ground Truth Decoder Layer L ce Reference object Descriptions both in training and inference

35. X-Trans2Cap Cross-Modal Fusion (CMF) Module

36. Experiments 3D Dense Captioning with Gound Truth Proposals (Nr3D and ScanRefer）

37. Experiments 3D Dense Captioning with Detection Proposals (Nr3D and ScanRefer）

38. Experiments Visualization

39. X-Trans2Cap: Cross-Modal K nowledge Transfer using Transformer for 3D Dense Captioning Thanks for watching! Zhihao Yuan 1,† , Xu Yan 1,† , Y inghong Liao 1 , Y ao Guo 2 , Guanbin Li 3 , Shuguang Cui 1 , Zhen Li 1,* The Chinese University of Hong Kong (Shenzhen), The Future Network of Intelligence Institute, Shenzhen Research Institute of Big Data, 2 Shanghai Jiao Tong University, 3 Sun Yat-sen University 1

40. 目录 • AIGC驱动的3D室内场景稠密描述及视觉定位 • AIGC驱动的3D高精度的说话人脸驱动及生成 • AIGC驱动的结肠镜图片生成及解析

41. 说话人脸任务介绍语音/文本深度网络人脸图片/视频输入模型结构说话人脸视频输出  目标： • 给定文本或语音作为驱动信息，同时给定人脸图片或视频提供人物信息，目标是生成人脸视频其嘴型与文本或语音内容保持一致；  挑战： • 跨模态学习任务，语音/文本模态到图像模态的映射，需要设计多模态的特征提取器以及跨模态交互学习； • 人类视觉系统对生成视频图像质量和语音-嘴型同步质量比较敏感，生成高质量的说话人脸视频有挑战；

42. 说话人脸已有成果  时序3DMM方案（利用点云解析相关算法预测脸部关键点） • 利用精细3D人脸顶点进行语音到嘴型的显式监督，并考虑长时时序信息，得到稳定人脸视频；

43. 点云解析驱动的高清说话人脸生成结果 3D animation Blend result Generated result

44. 点云解析驱动的高清说话人脸不同语言生成结果中文德语

45. 点云解析驱动的高清说话人脸不同语言生成结果粤语英文

46. 目录 • AIGC驱动的3D室内场景稠密描述及视觉定位 • AIGC驱动的3D高精度的说话人脸驱动及生成 • AIGC驱动的结肠镜图片生成及解析

47. ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Diffusion Models

48. Background (( )Colonoscopy analysis, particularly automatic polyp segmentation and detection are essential for assisting clinical diagnosis and treatment, while the scarcity of annotated data limits the effectiveness and generalization of existing models. (( )The quality of generated data by GANs or other data augmentation methods is poor. (( )Diffusion models have demonstrated remarkable progress in generating multiple modalities of medical data (CT, MRI, …).

49. Overview of the Pipeline GT Masks Original Images Downstream Tasks Diffusion Sampler Segmentation Detection … … ArSDM Synthesis Images e.g. e.g.

50. Pipeline (( )Train a semantic diffusion model (our ArSDM). (( )For each mask in the training set, sample a synthesized image, the synthesized dataset has the same number of image- mask pairs as the original dataset. (( )Combine the original diffusion training set with the synthesized dataset for training polyp segmentation and detection models

51. Model Architecture Condition Diffusion Process Input � U-Net W eights Map � � � Estimated Noised Re-Weighting Module GT Mask Sample Diffusion Loss ℒ 풂풅풂풑 Refinement Loss ℒ 풓 풇 PraNet Prediction Mask

52. Model Architecture • Mask Conditioning • Using the segmentation masks as conditions, similar to semantic masks but have only two categories: foreground (polyp) and background (intestine wall) • The conditional U-Net model is the same as SDM (Semantic Image Synthesis via Diffusion Models https://arxiv.org/abs/2207.00050) • Adaptive Loss Function • Based on � 1 loss, define a pixel-wise weights matrix that vests different weights according to the size ratio of the polyp over the background. • For coding, it is convenient to use the pixel values of the segmentation mask (0,1).

53. Model Architecture • Mask Conditioning • Adaptive Loss Function • Refinement • Using a pre-trained segmentation model to fine-tune the diffusion model, in which the U-Net parameters are updated while the segmentation model parameters are fixed. • For each time-step , we need to sample an image, which is time-consuming.

54. Experimental Settings Diffusion Training • Training Set: Kvasir + CVC-ClinicDB (1450 image-mask pairs) • Image Size: Padding to have the same height and width and then resize to 384 × 384 • Duration: • with Refinement: around one-half NVIDIA A100 days (80GB Memory) • w/o refinement: around one A100 day. Diffusion Sampling • DDIM sampler with � = 200 • Random noise as input and mask as a condition

55. Comparison Results Polyp Segmentation

56. Comparison Results Polyp Detection

57. Visualization

58. Colonscopy Video Generation with Diffusion Models

59. PVDM 用sky-time-lapse训练autoencoder, from scratch Dataset: Sky Time-lapse 997 段视频; 总共 1,172,641 帧 Inputs Training: 1 V100, 1 day Reconstructions

60. PVDM 用LDPolyp训练autoencoder, 加载sky-time-lapse的权重 Dataset: LDPolyp 100 段视频; 总共 24,789 帧 Training: 1 V100, 1.6 days Inputs Reconstructions

61. LVDM 用LDPolyp训练 2D autoencoder, 加载ImageNet训练权重 Dataset: LDPolyp 100 段视频; 总共 24,789 帧 Training: 1 V100, 1 day Inputs Reconstructions

62. 用LDPolyp训练 unconditional diffusion model LVDM-2 (LVDM 5.1 release codes) Dataset: LDPolyp 100 段视频; 总共 24,789 帧 Training: 1 80GB A100, 1 day Details: 3D-Unet 添加部分Attention layer 作attention 参数过多，batch_size 2 -> 48 GB 显存 Samples √

63. 下一步启示 • 进一步优化多模态在3D场景的解析与生成 • 结合video diffusion来强化说话人脸的效果 • 结合condition mask来进行医疗图像场景的video diffusion生成