主动智能视频化身:基于闭环世界建模
如果无法正常显示,请先停止浏览器的去广告插件。
1. Active Intelligence in Video Avatars via
Closed-loop World Modeling
主动智能视频化身:基于闭环世界建模
2. Motivation
• Current methods produce passive motions with limited semantic understanding.
• Our Online Reasoning and Cognitive Architecture (ORCA) enables complex, multi-step task
execution through OTAR (Observe-Think-Act-Reflect) closed-loop reasoning.
3. Task Definition
• Long-horizon, Interactive Visual Avatar (L-IVA)
4. Online Reasoning and Cognitive Architecture
• ORCA operates in an Observe-Think-Act-Reflect loop where predicted states are continuously
verified against actual outcomes, triggering re-generation when mismatched.
5. Observe-Think-Act-Reflect (OTAR)
6. Hierarchical Dual-System Architecture
• System 2 for strategic reasoning
Observe, Think, Reflect
• System 1 for action grounding
Act
7. The L-IVA Benchmark: Goal-Directed
Evaluation
• 5 categories
• 92 synthetic and 8 real
images
• Most tasks needs 5
subgoals to complete
8. Evaluation Metrics
• Task Success Rate (TSR):The primary metric for measuring sub-goal completion
progress, defined as
• Best-Worst Scaling (BWS): A human preference ranking metric used to assess the
overall quality of agent performance in a robust comparative manner.
• Physical Plausibility Score (PPS): A human-rated diagnostic metric that evaluates
physical realism, including object permanence and spatial consistency.
• Action Fidelity Score (AFS) : A VLM-based diagnostic metric that measures the
semantic alignment between commands and video clips
9. Experiments
10. Qualitative Results: Video Comparison
Open-LoopReactiveVAGENORCA
Kitchen caseKitchen caseKitchen caseKitchen case
Open-LoopReactiveVAGENORCA
Garden caseGarden caseGarden caseGarden case
11. Ablation Study: Validating Design Choices
• World Modeling
• Removing Belief State tracking causes the most
severe TSR degradation. the agent cannot track
completed sub-goals, leading to repetitive actions.
• Closed-loop verification
• Removing Reflect significantly degrades Human
Preference (BWS). Incorrect generations corrupt
subsequent steps and visual identity.
• Hierarchical Action
• System 1 is essential for execution precision. It
bridges the gap between high-level reasoning and
the specific requirements of I2V models.
12. Conclusion & Future Work
L-IVA BenchmarkORCA FrameworkSOTA Results
The first benchmark forA novel closed-loopSignificant improvements in
evaluating goal-directedarchitecture enablingplanning in stochasticactive intelligence via thegenerative environments.OTAR cycle.
task success rate and physical
plausibility over existing
baselines.
Looking Ahead
We aim to extend ORCA to multi-agent collaboration and real-time
interactive scenarios, further bridging the gap between passive
animation and active intelligence.