U-Mind:⾯向实时多模态交互与视听⽣成的统⼀框架
如果无法正常显示,请先停止浏览器的去广告插件。
1. U-Mind: A Unified Framework for Real-Time
Multimodal Interaction with Audiovisual Generation
U-Mind:面向实时多模态交互与视听生成的统一框架
2. 智能对话交互视频生成
背景:现有多模态大模型已经能够处理文本、图像、语音等多个模态之间的交
互映射,但对于如何同时生成生动一致的人体姿态并没有很好的探索
目标:
接收用户多种模态(语音、文本、动作等)的输入,返回多种模态(语音、文本、动作等)
答复
进一步地,能理解用户指令做出响应动作
3. 智能对话交互视频生成
相关工作:Siggraph24 Semantic Gesticulator: Semantics-Aware Co-Speech Gesture
Synthesis (~90h motion data)
Step1: generate coherent gestures
(~80h motion data)
Step2: semantic gesture retrieval
(~7h motion data)
Step3: merge gesture
4. 智能对话交互视频生成
相关工作:CVPR25 SOLAMI: Social Vision-Language-Action Modeling for Immersive
Interaction with 3D Autonomous Characters
Step1: Multi-task Pre-training for Modality Alignment
(46k motion-text pair; 11k interactive motion pair;
410k speech-text par; 100k speech dialogue pair)
Step2: Instruction Tuning for Multi-turn Conversation
(5.7k multimodal interaction data)
5. 智能对话交互视频生成
任务:输入用户问题和人体参考图像,生成对话回复和肢体动作,并输出视听同步
的高保真视频流
背景:现有动作生成模型仅能充当“翻译器”,缺乏对话与推理能力
背景:现有的交互框架仅停留在指令跟随动作生成,缺乏智能理解和视频渲染能力
6. 智能对话交互视频生成
任务:输入用户问题和人体参考图像,生成对话回复和肢体动作,并输出视听同步
的高保真视频流
7. 智能对话交互视频生成
挑战:实现多模态(文本语音动作)对⻬的同时保留智能推理能力
方法:文本思维链引导的多模态生成框架
强制模型先生成思维链规划整体生成,再依次生成文本、语音、动作多模态回复,解锁复杂
指令跟随和指令对⻬
提出LLM驱动的数据合成流水线,通过韵律分段以增强精细对⻬,并进行对话式重写以赋予
指令跟随能力,将现有模态对⻬数据转化为动态交互数据。
8. 智能对话交互视频生成
挑战:直接生成高分辨率视频计算量大、难以实时
目标:在保证高清画质的前提下,大幅降低生成延迟,支持实时视频输出
方法:两阶段 Latent 空间超分框架
Latent Upsampler:训练轻量 3D 卷积网络,在 VAE latent 空间执行 2× 空间上采样,通过全局
残差跳连与感知 Loss 提供高质量结构先验,替代模糊的双线性插值
OmniSR Modified Flow 精修:以加噪低分 latent 为起点,显式学习 SR 映射向量场,仅需少步
去噪即可完成高清精修,有效抑制 artifacts 同时保持时序一致性
9. 研究内容 智能交互人体视频生成
数据构建
目标:构建高质量的多模态对话数据,以支持模型理解用户意图和指令,生成
合理的回复和相应动作
互联网视频数据&音频动作数据集:以现有数据为回答,构建提问和思维链
文本动作数据集:以动作文本描述为基准,构建问答和思维链(包含指令跟随和情景对话两
种模式)
总计构建约300k条数据
互联网视频数据
音频动作数据集
文本动作数据集
10. 智能对话交互视频生成
What do you think is the best job for you?What is your favorite type of movie?
🔊: “In my opinion, I think the most ideal career for me is to be a
journalist. You know, I have dreamed to be a journalist since I was
still in middle school, and I always wanted to be a person who tells
the truth.”🔊: “My favorite kind of movies are romantic films such as Titanic.
It's a fantastic film. It captures many young people's heart with its
amazing music and sentimentality.”
11. 智能对话交互视频生成
Clap your hands to celebrate the good news!
🔊: “Oh, nice to hear that. I’m so happy!”
Can you pretend to play an imaginary guitar by
lifting your arms and strumming?
🔊: “I'm starting to play a guitar now, using my left arm to hold the
neck steady while I play with my left leg.”
12. 研究内容 智能交互人体视频生成
A figure in black crouches on the park path, anchored by
a laughing child. They move in sync toward a nearby
flowerbed, their eyes fixed on a dancing butterfly.
Dawn light filters onto a deserted city street, where the
early morning silence is yet to be broken by the rush of
the day.
Look! The butterfly’s flying toward the flowers,
can you catch it?You’re going to be late for the race if you don’t
get moving!
🔊: “I’ve got you, sweetheart. Just let me hold your hand a little
longer—don’t worry, we’ll catch it before it lands!”🔊: “I’m warming up, just getting ready to hit the pace. Don’t
worry.”