主动智能视频化身:基于闭环世界建模

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
1. Active Intelligence in Video Avatars via Closed-loop World Modeling 主动智能视频化身:基于闭环世界建模
2. Motivation • Current methods produce passive motions with limited semantic understanding. • Our Online Reasoning and Cognitive Architecture (ORCA) enables complex, multi-step task execution through OTAR (Observe-Think-Act-Reflect) closed-loop reasoning.
3. Task Definition • Long-horizon, Interactive Visual Avatar (L-IVA)
4. Online Reasoning and Cognitive Architecture • ORCA operates in an Observe-Think-Act-Reflect loop where predicted states are continuously verified against actual outcomes, triggering re-generation when mismatched.
5. Observe-Think-Act-Reflect (OTAR)
6. Hierarchical Dual-System Architecture • System 2 for strategic reasoning Observe, Think, Reflect • System 1 for action grounding Act
7. The L-IVA Benchmark: Goal-Directed Evaluation • 5 categories • 92 synthetic and 8 real images • Most tasks needs 5 subgoals to complete
8. Evaluation Metrics • Task Success Rate (TSR):The primary metric for measuring sub-goal completion progress, defined as • Best-Worst Scaling (BWS): A human preference ranking metric used to assess the overall quality of agent performance in a robust comparative manner. • Physical Plausibility Score (PPS): A human-rated diagnostic metric that evaluates physical realism, including object permanence and spatial consistency. • Action Fidelity Score (AFS) : A VLM-based diagnostic metric that measures the semantic alignment between commands and video clips
9. Experiments
10. Qualitative Results: Video Comparison Open-LoopReactiveVAGENORCA Kitchen caseKitchen caseKitchen caseKitchen case Open-LoopReactiveVAGENORCA Garden caseGarden caseGarden caseGarden case
11. Ablation Study: Validating Design Choices • World Modeling • Removing Belief State tracking causes the most severe TSR degradation. the agent cannot track completed sub-goals, leading to repetitive actions. • Closed-loop verification • Removing Reflect significantly degrades Human Preference (BWS). Incorrect generations corrupt subsequent steps and visual identity. • Hierarchical Action • System 1 is essential for execution precision. It bridges the gap between high-level reasoning and the specific requirements of I2V models.
12. Conclusion & Future Work L-IVA BenchmarkORCA FrameworkSOTA Results The first benchmark forA novel closed-loopSignificant improvements in evaluating goal-directedarchitecture enablingplanning in stochasticactive intelligence via thegenerative environments.OTAR cycle. task success rate and physical plausibility over existing baselines. Looking Ahead We aim to extend ORCA to multi-agent collaboration and real-time interactive scenarios, further bridging the gap between passive animation and active intelligence.

Home - Wiki
Copyright © 2011-2026 iteam. Current version is 2.155.2. UTC+08:00, 2026-06-20 19:38
浙ICP备14020137号-1 $Map of visitor$