UniComp- Rethinking Video Compression Through Informational Uniqueness

如果无法正常显示，请先停止浏览器的去广告插件。

1. UniComp: Rethinking Video Compression Through Informational Uniqueness CVPR Paper Presentation

2. Motivation: The Need for Uniqueness-Driven Compression ■ Quadratic Attention Cost: Dense video tokens (e.g., 32×256) cause huge compute bottlenecks. ■ Redundancy in Attention: Methods emphasize salient yet often repetitive content, missing fine- grained diversity. ■ Failure at Extreme Compression: At ≤10% retention, many models drop severely in accuracy. ■ Our Solution: UniComp preserves unique information and stays robust even at 5% retention. Left: UniComp answers correctly at 5% retention. Right: Performance across retention ratios.

3. Core Insight: Attention ≠ Information Attention-Based Selection Traditional methods cluster tokens around highly salient objects. While highlighting primary subjects, this causes severe redundancy in focal areas and leads to missing information in peripheral but contextually critical regions. Uniqueness-Based Selection (UniComp) Our approach prioritizes tokens carrying diverse, irreplaceable information. By minimizing redundancy, it ensures broader spatial and temporal coverage, retaining the essential details needed for complex reasoning tasks. Figure 2: Attention-based methods miss the third cup due to redundancy, while UniComp's uniqueness-based selection captures all necessary details for correct QA.

4. Theoretical Foundation: Information Uniqueness Optimization Objective Video compression is formulated as minimizing the conditional entropy between the selected tokens and the full token set, aiming to preserve maximal information. Reconstruction Error Assuming an isotropic Gaussian distribution, minimizing conditional entropy is mathematically equivalent to minimizing the reconstruction error of discarded tokens. Information Uniqueness We introduce uniqueness to measure intrinsic redundancy among tokens. It quantifies how much irreplaceable information a token carries compared to others. Theoretical Upper Bound We derive an upper bound linking reconstruction error directly to uniqueness. Minimizing this bound naturally leads to a greedy selection strategy based on information uniqueness.

5. Framework Overview: UniComp 1. Frame Group Fusion (FGF)2. Token Allocation (TA)3. Spatial Dynamic Compression (SDC) Adaptively merges temporally redundant frames based on semantic uniqueness to reduce temporal redundancy.Dynamically allocates token budgets according to global frame uniqueness, distributing computation to informative content.Greedily selects and fuses tokens within each frame based on token-level uniqueness to eliminate local redundancy. Figure 3: The UniComp framework integrates temporal fusion, global allocation, and spatial compression under a unified principle.

6. Module 1: Frame Group Fusion (FGF) ■ Objective: Adaptively merge temporally redundant frames based on semantic uniqueness to reduce temporal redundancy. ■ Global Feature Extraction: Computes a global feature for each frame via average pooling of all visual tokens. ■ Sequential Grouping: Scans frames sequentially. If a frame's uniqueness difference from the group's representative frame is below threshold U_f, it is merged. ■ Mean Pooling Fusion: Each semantic group is fused into a single representative feature via mean pooling. Result: In stable scenes, multiple frames are merged to suppress redundancy; in segments with large semantic transitions, grouping becomes finer to retain key dynamic information. Figure: The Frame Group Fusion (FGF) process groups and fuses temporally redundant frames.

7. Module 2: Token Allocation (TA) Objective & Principle ■ Dynamic Distribution: Allocation Mechanism ■ Allocates the total token budget across fused frames based on their global semantic uniqueness. High Uniqueness Frames: Represent significant semantic changes. Allocated more tokens to preserve critical, irreplaceable details. 2. Variance Scaling 1. Global Uniqueness Calculation Compute uniqueness score for each fused frame representation: Normalize and scale scores to stabilize distribution across varying frame counts: U' ■ Low Uniqueness Frames: Contain redundant or static information. Allocated fewer tokens, as missing details can be inferred from adjacent frames. ■ Re-allocation Strategy: If allocated tokens exceed a frame's maximum limit, the surplus is evenly redistributed to other frames to maximize budget utilization. U t = 1 - (1 / K f ) Σ cos_sim(f t , f s ) 3. Softmax Distribution Determine the proportional token budget K t out of the total limit: K t = Softmax(U' TOKEN max t ) × t = (U - U √K f t mean ) ×

8. Module 3: Spatial Dynamic Compression (SDC) Objective & Token UniquenessGreedy Selection & Fusion ■ Objective: Remove local spatial redundancy by keeping the■ Greedy Selection: Rank tokens by uniqueness and pick top most informative tokens per frame. ■ Token Uniqueness: Compute uniqueness per token using the last attention layer's Keys of the ViT. representatives iteratively. ■ Neighbor Fusion: Tokens within threshold U_c are fused into the representative instead of dropped. ■ Why Keys? Keys compactly summarize each token and have fewer dimensions for efficient representation. Efficiency Optimization: We replaced the causal loop with matrix- level parallel computation, reducing compute by nearly 20× with no performance loss.

9. Main Results: State-of-the-Art Performance ■ Consistent Superiority: UniComp consistently surpasses baselines across LongVideoBench, EgoSchema, MLVU, and VideoMME. ■ High Retention (25%): Reaches 60.78% average accuracy, outperforming FastV by 2.18 points. ■ Extreme Compression (10%): Surpasses HoliTom by 0.9 points, proving exceptional robustness under strict token limits. Table 1: Comparison of state-of-the-art video compression methods across benchmarks with long videos.

10. Scalability: Robustness Across Extensive Contexts Performance under large number of frames input (constrained to 6,272 tokens) ■ UniComp maintains strong performance as frame count scales from 32 to 320 frames. ■ At 320 frames with 6,272 token limit (10% retention): achieves 62.45% accuracy. ■ Outperforms HoliTom by 1.02 points at this extreme compression setting. ■ Demonstrates consistent improvement over all baselines across all frame counts. Figure: Performance comparison across different frames (same token limitation). Key Takeaway: UniComp exhibits exceptional scalability. Even when compressing 320 frames into a strict limit of 6,272 tokens (a 10% retention ratio), it achieves 62.45% accuracy, outperforming HoliTom by 1.02 points.

11. Ablation Study: Internal Module Design ■ Objective: Validate FGF, TA, and SDC under 32-frame, 20% retention. ■ Component Synergy: The three modules work jointly; removing one drops performance. ■ Uniform vs. Adaptive: Uniform allocation degrades results versus TA. ■ Fusion vs. Dropping: Skipping neighbor fusion in SDC reduces accuracy. KEY TAKEAWAY Temporal fusion, global allocation, and spatial compression together best preserve information. Table: Ablation study of FGF, TA, and SDC across benchmarks.

12. Ablation Study: Efficiency & Hyper-parameters Efficiency (TTFT)Hyper-parameters ■ UniComp achieves up to 4.15× faster Time-To-First-Token on 320 frames compared to full-token inference.■ U_f (FGF threshold): Performance is highly stable across a wide range (0.002 to 0.010). ■ Most overhead comes from the SDC module, but overall processing is significantly accelerated.■ U_c (SDC threshold): Current setting of 0.2 achieves excellent performance, adjustable per benchmark. ■ Only 2 hyper-parameters needed, making deployment straightforward. Figure 5: TTFT breakdown showing 4.15x speedup. Figure 6: Hyper-parameter stability across benchmarks.

13. Qualitative Analysis: Frequent Scene Switches ■ Adaptive Allocation: In videos with frequent scene transitions, UniComp dynamically allocates more tokens to frames containing significant semantic changes. ■ Detail Preservation: It successfully preserves critical spatial details (e.g., foreground objects, human faces) while aggressively compressing redundant static backgrounds.

14. Qualitative Analysis: Static Scene ■ Aggressive Compression: In static or low-information scenes, UniComp performs aggressive compression to save the token budget for more dynamic segments. ■ Redundancy Elimination: It effectively fuses redundant background tokens while retaining the minimal necessary information to understand the scene context.

15. Conclusion & Future Work Core Contributions ■ Information Uniqueness: Shifted paradigm to uniqueness- based diversity for video compression. ■ Unified Framework: Integrated Frame Group Fusion (FGF), Token Allocation (TA), and Spatial Dynamic Compression (SDC). ■ SOTA Performance: Achieved state-of-the-art results across benchmarks, robust at extreme compression ratios (≤10%). ■ High Efficiency: Delivered up to 4.15× speedup in Time-To- First-Token (TTFT) without complex tuning. Code and Models are Open Sourced!