Infinity:视觉自回归生成新路线

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
1. Infinity:视觉自回归生成新路线 CVPR2025 Oral 演讲人:韩剑
2. 目录 01 自回归模型和Scaling Law 02 视觉自回归 v.s. 扩散模型 03 Infinity: 视觉自回归生成新路线 分析和思考
3.
4. 01 自回归模型和Scaling Law
5. AutoRegressive Models Autoregressive Sequence Modeling credit: Autoregressive Models in Vision: A Survey Sequence Representation
6. Scaling Law [2020] Scaling Laws for Neural Language Models
7. AutoRegressive Models Challenges: [2020] iGPT: Generative Image & Pretraining from Pixels [2017] VQVAE: Tokenize Images into discrete token index ➢ Autoregressive models perform significantly worse than sota diffusion models in high-resolution image systhesis. ➢ Autoregressive models have not demonstrated the same scaling law properties as LLMs in text generation. ➢ Due to raster order prediction, autoregressive models suffer from very slow prediction speeds. ➢ The raster-scan order is not the most natural "order" for images, as it loses the global information crucial for visual modeling. [2021] VQGAN: Image tokenizer + AutoRegressive transformer [1] Mark Chen et al., Generative Pretraining From Pixels, in ICML, 2020. [2] Aaron van et al., Neural Discrete Representation Learning, 2017. [3] PATRICK ESSER et al., Taming Transformers for High-Resolution Image Synthesis, 2021 CVPR. [4] Jiahui Yu et al., Scaling Autoregressive Models for Content-Rich Text-to-Image Generation, 2022 Arxiv. [2022] Parti : Scaling Up the AutoRegressive transformer
8. 02 视觉自回归v.s.扩散模型
9. Visual Autoregressive Models Next scale prediction vs Next token prediction • When humans perceive images or engage in painting, they often start with a holistic overview before delving into finer details. • This approach of going from coarse to fine, grasping the overall context before refining local details, is very natural.
10. Visual Autoregressive Models • Stage1 : Train multi-scale Image Tokenization for Images, which Tokenize Images into discrete token index • Stage2: Train GPT-style(AutoRegressive in scale space ) transformers on VQ-Tokens with teacher forcing
11. Visual Autoregressive Models ImageNet 512×512 conditional generation • Sota performance on imagenet benchmark. • Super fast inference speed • VAR significantly advances traditional AR capabilities. • Better perfomance than DiT(sora base model) ImageNet 256×256 conditional generation To our knowledge, this is the first time of autoregressive models outperforming Diffusion transformers!
12. Diffusion Models * Credit: [2020] Denoising Diffusion Probabilistic Models
13. Diffusion Models * Credit: [2021] High-Resolution Image Synthesis with Latent Diffusion Models
14. Diffusion Models * Credit: [2022] Scalable Diffusion Models with Transformers
15. VAR v.s. Diffusion VAR vs. Diffusion ➢ VAR's method of noising is more intuitive and interpretable (blurry to clear or low frequency to high frequency). ➢ VAR's learning process is potentially more efficient than diffusion (1 / 7 epochs of DiT). ➢ Similar to LLMs, VAR trains all timesteps simultaneously in a single forward pass, whereas diffusion trains only one timestep t at a time. ➢ Both VAR and diffusion allow for "corrections," enabling the model to rectify errors from past timesteps, a capability not available in AR models.
16. 03 Infinity: 视觉自回归生成新路线
17. Extend VAR to T2I Generation Class-conditional to Image(C2I) Text to Image(T2I)
18. Challenges Poor discrete reconstrution: index-wise discrete tokenizer with a limited vocabulary size employed faces significant quantization errors Ground Truth VAR Reconstrution
19. Challenges Cumulative errors:train-test discrepancies from teacher forcing training, inherent to LLMs, amplify cumulative errors in visual details.
20. Challenges High-resolution: VAR has not yet been validated for generating high-resolution (1024×1024) realistic images in complex text-to-image tasks.
21. Challenges These limitation lead us to ask the following questions: ➢ Can discrete image tokenizers achieve reconstruction performance comparable to state-of-the-art continuous VAEs, especially in preserving high-frequency texture details? ➢ Can visual autoregressive generation overcome the accumulation issues caused by teacher forcing and maintain strong robustness in long-sequence generation? ➢ Can discrete visual autoregressive modeling generate high-resolution, instruction- compliant images in complex text-to-image tasks comparable to state-of-the-art diffusion models (e.g., FLUX Dev, SD3)?
22. Infinity Infinity redefines VAR under a bitwise modeling framework Bitwise modeling: ➢ Bitwise Tokenizer ➢ Infinite-Vocabulary Classifier ➢ Bitwise Self-Correction
23. Bitwise Tokenizer Index-wise quantization -> Bitwise quantization * Credit: [2024] Image and Video Tokenization with Binary Spherical Quantization
24. Bitwise Tokenizer An example for VQ and BSQ. We omit the channel dimension for simplicity F down NN F down +1 or -1 up up R1 R1 NN F-R1 down down • • • F-R1 +1 or -1 up R2 up R2 Vector Quantize BSQ up is up sample down is down sample NN is nearest neighbour
25. Bitwise Token ➢ Infinite-Vocabulary Classifier: predicts d bits instead of preidicting 2 d indices ➢ Robust to slight perturbation and error
26. Bitwise Tokenizer By Scaling Visual Tokenizer’s Vocabulary, our tokenizer improves and even surpasses continuous VAE of SD on ImageNet-rFID.
27. Bitwise Tokenizer
28. Bitwise Self Correction Bitwise Self-Correction to mitigate the issue of train-test discrepancy ➢ Random flip bits to imitates bitwise prediction errors ➢ Re-quantizing the residual features to auto-correct the previous errors
29. Bitwise Self Correction Qualitatively, Substantial advantages are observed after applying bitwise self-correction.
30. Bitwise Self Correction Using Bitwise Self-Correction generates better results (right)
31. Scaling Vocabulary Size benefits Generation
32. Scaling Transformer
33. DPO
34. DPO
35. Great Speed Advantage
36. State-of-the-art Generation Results
37. State-of-the-art Generation Results
38. State-of-the-art Generation Results
39. T2I Arena
40. 04 分析和思考
41. Infinity ➢ An autoregressive model with Bitwise Modeling, which significantly improves the scaling and visual detail representation capabilities of discrete generative models. ➢ Demonstrates the potential of scaling tokenizers and transformers by achieving near- continuous tokenizer performance ➢ Enables a discrete autoregressive text-to-image model to achieve exceptionally strong prompt adherence and superior image generation quality.
42.
43. THANKS 探索 AI 应用边界 Explore the limits of AI applications

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.146.0. UTC+08:00, 2025-10-19 01:12
浙ICP备14020137号-1 $访客地图$