UniComp- Rethinking Video Compression Through Informational Uniqueness
如果无法正常显示,请先停止浏览器的去广告插件。
1. UniComp: Rethinking Video
Compression Through Informational
Uniqueness
CVPR Paper Presentation
2. Motivation: The Need for Uniqueness-Driven Compression
■ Quadratic Attention Cost: Dense video tokens
(e.g., 32×256) cause huge compute
bottlenecks.
■ Redundancy in Attention: Methods emphasize
salient yet often repetitive content, missing fine-
grained diversity.
■ Failure at Extreme Compression: At ≤10%
retention, many models drop severely in
accuracy.
■ Our Solution: UniComp preserves unique
information and stays robust even at 5%
retention.
Left: UniComp answers correctly at 5% retention. Right: Performance across retention ratios.
3. Core Insight: Attention ≠ Information
Attention-Based Selection
Traditional methods cluster tokens around highly salient
objects. While highlighting primary subjects, this causes
severe redundancy in focal areas and leads to missing
information in peripheral but contextually critical regions.
Uniqueness-Based Selection (UniComp)
Our approach prioritizes tokens carrying diverse,
irreplaceable information. By minimizing redundancy, it
ensures broader spatial and temporal coverage, retaining
the essential details needed for complex reasoning tasks.
Figure 2: Attention-based methods miss the third cup due to redundancy, while UniComp's
uniqueness-based selection captures all necessary details for correct QA.
4. Theoretical Foundation: Information Uniqueness
Optimization Objective
Video compression is formulated as minimizing the conditional
entropy between the selected tokens and the full token set, aiming to
preserve maximal information.
Reconstruction Error
Assuming an isotropic Gaussian distribution, minimizing conditional
entropy is mathematically equivalent to minimizing the reconstruction
error of discarded tokens.
Information Uniqueness
We introduce uniqueness to measure intrinsic redundancy among
tokens. It quantifies how much irreplaceable information a token
carries compared to others.
Theoretical Upper Bound
We derive an upper bound linking reconstruction error directly to
uniqueness. Minimizing this bound naturally leads to a greedy
selection strategy based on information uniqueness.
5. Framework Overview: UniComp
1. Frame Group Fusion (FGF)2. Token Allocation (TA)3. Spatial Dynamic Compression (SDC)
Adaptively merges temporally redundant frames
based on semantic uniqueness to reduce temporal
redundancy.Dynamically allocates token budgets according to
global frame uniqueness, distributing computation to
informative content.Greedily selects and fuses tokens within each frame
based on token-level uniqueness to eliminate local
redundancy.
Figure 3: The UniComp framework integrates temporal fusion, global allocation, and spatial compression under a unified principle.
6. Module 1: Frame Group Fusion (FGF)
■ Objective: Adaptively merge temporally redundant frames
based on semantic uniqueness to reduce temporal
redundancy.
■ Global Feature Extraction: Computes a global feature for
each frame via average pooling of all visual tokens.
■ Sequential Grouping: Scans frames sequentially. If a frame's
uniqueness difference from the group's representative
frame is below threshold U_f, it is merged.
■ Mean Pooling Fusion: Each semantic group is fused into a
single representative feature via mean pooling.
Result: In stable scenes, multiple frames are merged to suppress
redundancy; in segments with large semantic transitions,
grouping becomes finer to retain key dynamic information.
Figure: The Frame Group Fusion (FGF) process groups and fuses temporally redundant
frames.
7. Module 2: Token Allocation (TA)
Objective & Principle
■
Dynamic Distribution:
Allocation Mechanism
■
Allocates the total token
budget across fused frames
based on their global
semantic uniqueness.
High Uniqueness Frames:
Represent significant
semantic changes. Allocated
more tokens to preserve
critical, irreplaceable details.
2. Variance Scaling
1. Global Uniqueness
Calculation
Compute uniqueness score for
each fused frame representation:
Normalize and scale scores to
stabilize distribution across
varying frame counts:
U'
■
Low Uniqueness Frames:
Contain redundant or static
information. Allocated fewer
tokens, as missing details
can be inferred from adjacent
frames.
■
Re-allocation Strategy: If
allocated tokens exceed a
frame's maximum limit, the
surplus is evenly
redistributed to other frames
to maximize budget
utilization.
U t = 1 - (1 / K f ) Σ
cos_sim(f t , f s )
3. Softmax Distribution
Determine the proportional
token budget K t out of the total
limit:
K t = Softmax(U'
TOKEN max
t
) ×
t
= (U
- U
√K f
t
mean
) ×
8. Module 3: Spatial Dynamic Compression (SDC)
Objective & Token UniquenessGreedy Selection & Fusion
■ Objective: Remove local spatial redundancy by keeping the■ Greedy Selection: Rank tokens by uniqueness and pick top
most informative tokens per frame.
■ Token Uniqueness: Compute uniqueness per token using
the last attention layer's Keys of the ViT.
representatives iteratively.
■ Neighbor Fusion: Tokens within threshold U_c are fused
into the representative instead of dropped.
■ Why Keys? Keys compactly summarize each token and
have fewer dimensions for efficient representation.
Efficiency Optimization: We replaced the causal loop with matrix-
level parallel computation, reducing compute by nearly 20× with
no performance loss.
9. Main Results: State-of-the-Art Performance
■ Consistent Superiority: UniComp consistently surpasses baselines across LongVideoBench, EgoSchema, MLVU, and
VideoMME.
■ High Retention (25%): Reaches 60.78% average accuracy, outperforming FastV by 2.18 points.
■ Extreme Compression (10%): Surpasses HoliTom by 0.9 points, proving exceptional robustness under strict token limits.
Table 1: Comparison of state-of-the-art video compression methods across benchmarks with long videos.
10. Scalability: Robustness Across Extensive Contexts
Performance under large number of frames input
(constrained to 6,272 tokens)
■ UniComp maintains strong performance as frame count
scales from 32 to 320 frames.
■ At 320 frames with 6,272 token limit (10% retention):
achieves 62.45% accuracy.
■ Outperforms HoliTom by 1.02 points at this extreme
compression setting.
■ Demonstrates consistent improvement over all baselines
across all frame counts.
Figure: Performance comparison across different frames (same token limitation).
Key Takeaway: UniComp exhibits exceptional scalability. Even
when compressing 320 frames into a strict limit of 6,272 tokens
(a 10% retention ratio), it achieves 62.45% accuracy,
outperforming HoliTom by 1.02 points.
11. Ablation Study: Internal Module Design
■ Objective: Validate FGF, TA, and SDC under 32-frame,
20% retention.
■ Component Synergy: The three modules work jointly;
removing one drops performance.
■ Uniform vs. Adaptive: Uniform allocation degrades
results versus TA.
■ Fusion vs. Dropping: Skipping neighbor fusion in SDC
reduces accuracy.
KEY TAKEAWAY
Temporal fusion, global allocation, and spatial compression
together best preserve information.
Table: Ablation study of FGF, TA, and SDC across benchmarks.
12. Ablation Study: Efficiency & Hyper-parameters
Efficiency (TTFT)Hyper-parameters
■ UniComp achieves up to 4.15× faster Time-To-First-Token on 320
frames compared to full-token inference.■ U_f (FGF threshold): Performance is highly stable across a wide
range (0.002 to 0.010).
■ Most overhead comes from the SDC module, but overall
processing is significantly accelerated.■ U_c (SDC threshold): Current setting of 0.2 achieves excellent
performance, adjustable per benchmark.
■ Only 2 hyper-parameters needed, making deployment
straightforward.
Figure 5: TTFT breakdown showing 4.15x speedup.
Figure 6: Hyper-parameter stability across benchmarks.
13. Qualitative Analysis: Frequent Scene Switches
■ Adaptive Allocation: In videos with frequent
scene transitions, UniComp dynamically
allocates more tokens to frames containing
significant semantic changes.
■ Detail Preservation: It successfully preserves
critical spatial details (e.g., foreground objects,
human faces) while aggressively compressing
redundant static backgrounds.
14. Qualitative Analysis: Static Scene
■ Aggressive Compression: In static or low-information scenes, UniComp performs aggressive compression to save the
token budget for more dynamic segments.
■ Redundancy Elimination: It effectively fuses redundant background tokens while retaining the minimal necessary
information to understand the scene context.
15. Conclusion & Future Work
Core Contributions
■ Information Uniqueness: Shifted paradigm to uniqueness-
based diversity for video compression.
■ Unified Framework: Integrated Frame Group Fusion (FGF),
Token Allocation (TA), and Spatial Dynamic Compression
(SDC).
■ SOTA Performance: Achieved state-of-the-art results
across benchmarks, robust at extreme compression ratios
(≤10%).
■ High Efficiency: Delivered up to 4.15× speedup in Time-To-
First-Token (TTFT) without complex tuning.
Code and Models are Open Sourced!