主动智能视频化身：基于闭环世界建模

如果无法正常显示，请先停止浏览器的去广告插件。

1. Active Intelligence in Video Avatars via Closed-loop World Modeling 主动智能视频化身：基于闭环世界建模

2. Motivation • Current methods produce passive motions with limited semantic understanding. • Our Online Reasoning and Cognitive Architecture (ORCA) enables complex, multi-step task execution through OTAR (Observe-Think-Act-Reflect) closed-loop reasoning.

3. Task Definition • Long-horizon, Interactive Visual Avatar (L-IVA)

4. Online Reasoning and Cognitive Architecture • ORCA operates in an Observe-Think-Act-Reflect loop where predicted states are continuously verified against actual outcomes, triggering re-generation when mismatched.

5. Observe-Think-Act-Reflect (OTAR)

6. Hierarchical Dual-System Architecture • System 2 for strategic reasoning Observe, Think, Reflect • System 1 for action grounding Act

7. The L-IVA Benchmark: Goal-Directed Evaluation • 5 categories • 92 synthetic and 8 real images • Most tasks needs 5 subgoals to complete

8. Evaluation Metrics • Task Success Rate (TSR)：The primary metric for measuring sub-goal completion progress, defined as • Best-Worst Scaling (BWS): A human preference ranking metric used to assess the overall quality of agent performance in a robust comparative manner. • Physical Plausibility Score (PPS): A human-rated diagnostic metric that evaluates physical realism, including object permanence and spatial consistency. • Action Fidelity Score (AFS) : A VLM-based diagnostic metric that measures the semantic alignment between commands and video clips

9. Experiments

10. Qualitative Results: Video Comparison Open-LoopReactiveVAGENORCA Kitchen caseKitchen caseKitchen caseKitchen case Open-LoopReactiveVAGENORCA Garden caseGarden caseGarden caseGarden case

11. Ablation Study: Validating Design Choices • World Modeling • Removing Belief State tracking causes the most severe TSR degradation. the agent cannot track completed sub-goals, leading to repetitive actions. • Closed-loop verification • Removing Reflect significantly degrades Human Preference (BWS). Incorrect generations corrupt subsequent steps and visual identity. • Hierarchical Action • System 1 is essential for execution precision. It bridges the gap between high-level reasoning and the specific requirements of I2V models.

12. Conclusion & Future Work L-IVA BenchmarkORCA FrameworkSOTA Results The first benchmark forA novel closed-loopSignificant improvements in evaluating goal-directedarchitecture enablingplanning in stochasticactive intelligence via thegenerative environments.OTAR cycle. task success rate and physical plausibility over existing baselines. Looking Ahead We aim to extend ORCA to multi-agent collaboration and real-time interactive scenarios, further bridging the gap between passive animation and active intelligence.