Infinity：视觉自回归生成新路线

1. Infinity：视觉自回归生成新路线 CVPR2025 Oral 演讲人：韩剑

2. 目录 01 自回归模型和Scaling Law 02 视觉自回归 v.s. 扩散模型 03 Infinity: 视觉自回归生成新路线分析和思考

3.

4. 01 自回归模型和Scaling Law

5. AutoRegressive Models Autoregressive Sequence Modeling credit: Autoregressive Models in Vision: A Survey Sequence Representation

6. Scaling Law [2020] Scaling Laws for Neural Language Models

7. AutoRegressive Models Challenges: [2020] iGPT: Generative Image & Pretraining from Pixels [2017] VQVAE: Tokenize Images into discrete token index ➢ Autoregressive models perform significantly worse than sota diffusion models in high-resolution image systhesis. ➢ Autoregressive models have not demonstrated the same scaling law properties as LLMs in text generation. ➢ Due to raster order prediction, autoregressive models suffer from very slow prediction speeds. ➢ The raster-scan order is not the most natural "order" for images, as it loses the global information crucial for visual modeling. [2021] VQGAN: Image tokenizer + AutoRegressive transformer [1] Mark Chen et al., Generative Pretraining From Pixels, in ICML, 2020. [2] Aaron van et al., Neural Discrete Representation Learning, 2017. [3] PATRICK ESSER et al., Taming Transformers for High-Resolution Image Synthesis, 2021 CVPR. [4] Jiahui Yu et al., Scaling Autoregressive Models for Content-Rich Text-to-Image Generation, 2022 Arxiv. [2022] Parti : Scaling Up the AutoRegressive transformer

8. 02 视觉自回归v.s.扩散模型

9. Visual Autoregressive Models Next scale prediction vs Next token prediction • When humans perceive images or engage in painting, they often start with a holistic overview before delving into finer details. • This approach of going from coarse to fine, grasping the overall context before refining local details, is very natural.

10. Visual Autoregressive Models • Stage1 : Train multi-scale Image Tokenization for Images, which Tokenize Images into discrete token index • Stage2: Train GPT-style(AutoRegressive in scale space ) transformers on VQ-Tokens with teacher forcing

11. Visual Autoregressive Models ImageNet 512×512 conditional generation • Sota performance on imagenet benchmark. • Super fast inference speed • VAR significantly advances traditional AR capabilities. • Better perfomance than DiT(sora base model) ImageNet 256×256 conditional generation To our knowledge, this is the first time of autoregressive models outperforming Diffusion transformers!

12. Diffusion Models * Credit: [2020] Denoising Diffusion Probabilistic Models

13. Diffusion Models * Credit: [2021] High-Resolution Image Synthesis with Latent Diffusion Models

14. Diffusion Models * Credit: [2022] Scalable Diffusion Models with Transformers

15. VAR v.s. Diffusion VAR vs. Diffusion ➢ VAR's method of noising is more intuitive and interpretable (blurry to clear or low frequency to high frequency). ➢ VAR's learning process is potentially more efficient than diffusion (1 / 7 epochs of DiT). ➢ Similar to LLMs, VAR trains all timesteps simultaneously in a single forward pass, whereas diffusion trains only one timestep t at a time. ➢ Both VAR and diffusion allow for "corrections," enabling the model to rectify errors from past timesteps, a capability not available in AR models.

16. 03 Infinity: 视觉自回归生成新路线

17. Extend VAR to T2I Generation Class-conditional to Image(C2I) Text to Image(T2I)

18. Challenges Poor discrete reconstrution: index-wise discrete tokenizer with a limited vocabulary size employed faces significant quantization errors Ground Truth VAR Reconstrution

19. Challenges Cumulative errors：train-test discrepancies from teacher forcing training, inherent to LLMs, amplify cumulative errors in visual details.

20. Challenges High-resolution: VAR has not yet been validated for generating high-resolution (1024×1024) realistic images in complex text-to-image tasks.

21. Challenges These limitation lead us to ask the following questions: ➢ Can discrete image tokenizers achieve reconstruction performance comparable to state-of-the-art continuous VAEs, especially in preserving high-frequency texture details? ➢ Can visual autoregressive generation overcome the accumulation issues caused by teacher forcing and maintain strong robustness in long-sequence generation? ➢ Can discrete visual autoregressive modeling generate high-resolution, instruction- compliant images in complex text-to-image tasks comparable to state-of-the-art diffusion models (e.g., FLUX Dev, SD3)?

22. Infinity Infinity redefines VAR under a bitwise modeling framework Bitwise modeling: ➢ Bitwise Tokenizer ➢ Infinite-Vocabulary Classifier ➢ Bitwise Self-Correction

23. Bitwise Tokenizer Index-wise quantization -> Bitwise quantization * Credit: [2024] Image and Video Tokenization with Binary Spherical Quantization

24. Bitwise Tokenizer An example for VQ and BSQ. We omit the channel dimension for simplicity F down NN F down +1 or -1 up up R1 R1 NN F-R1 down down • • • F-R1 +1 or -1 up R2 up R2 Vector Quantize BSQ up is up sample down is down sample NN is nearest neighbour

25. Bitwise Token ➢ Infinite-Vocabulary Classifier: predicts d bits instead of preidicting 2 d indices ➢ Robust to slight perturbation and error

26. Bitwise Tokenizer By Scaling Visual Tokenizer’s Vocabulary, our tokenizer improves and even surpasses continuous VAE of SD on ImageNet-rFID.

27. Bitwise Tokenizer

28. Bitwise Self Correction Bitwise Self-Correction to mitigate the issue of train-test discrepancy ➢ Random flip bits to imitates bitwise prediction errors ➢ Re-quantizing the residual features to auto-correct the previous errors

29. Bitwise Self Correction Qualitatively, Substantial advantages are observed after applying bitwise self-correction.

30. Bitwise Self Correction Using Bitwise Self-Correction generates better results (right)

31. Scaling Vocabulary Size benefits Generation

32. Scaling Transformer

33. DPO

34. DPO

35. Great Speed Advantage

36. State-of-the-art Generation Results

37. State-of-the-art Generation Results

38. State-of-the-art Generation Results

39. T2I Arena

40. 04 分析和思考

41. Infinity ➢ An autoregressive model with Bitwise Modeling, which significantly improves the scaling and visual detail representation capabilities of discrete generative models. ➢ Demonstrates the potential of scaling tokenizers and transformers by achieving near- continuous tokenizer performance ➢ Enables a discrete autoregressive text-to-image model to achieve exceptionally strong prompt adherence and superior image generation quality.

42.

43. THANKS 探索 AI 应用边界 Explore the limits of AI applications