Infinity:视觉自回归生成新路线
如果无法正常显示,请先停止浏览器的去广告插件。
1. Infinity:视觉自回归生成新路线
CVPR2025 Oral
演讲人:韩剑
2. 目录
01 自回归模型和Scaling Law
02 视觉自回归 v.s. 扩散模型
03 Infinity: 视觉自回归生成新路线
分析和思考
3.
4. 01
自回归模型和Scaling Law
5. AutoRegressive Models
Autoregressive Sequence Modeling
credit: Autoregressive Models in Vision: A Survey
Sequence Representation
6. Scaling Law
[2020] Scaling Laws for Neural Language Models
7. AutoRegressive Models
Challenges:
[2020] iGPT: Generative Image
& Pretraining from Pixels
[2017] VQVAE: Tokenize Images
into discrete token index
➢ Autoregressive models perform significantly worse than sota diffusion
models in high-resolution image systhesis.
➢ Autoregressive models have not demonstrated the same scaling law
properties as LLMs in text generation.
➢ Due to raster order prediction, autoregressive models suffer from very
slow prediction speeds.
➢ The raster-scan order is not the most natural "order" for images, as it
loses the global information crucial for visual modeling.
[2021] VQGAN: Image tokenizer +
AutoRegressive transformer
[1] Mark Chen et al., Generative Pretraining From Pixels, in ICML, 2020.
[2] Aaron van et al., Neural Discrete Representation Learning, 2017.
[3] PATRICK ESSER et al., Taming Transformers for High-Resolution Image Synthesis, 2021 CVPR.
[4] Jiahui Yu et al., Scaling Autoregressive Models for Content-Rich Text-to-Image Generation, 2022 Arxiv.
[2022] Parti : Scaling Up the
AutoRegressive transformer
8. 02
视觉自回归v.s.扩散模型
9. Visual Autoregressive Models
Next scale prediction vs Next token prediction
• When humans perceive images or engage in painting, they often start with a holistic overview before delving into finer
details.
• This approach of going from coarse to fine, grasping the overall context before refining local details, is very natural.
10. Visual Autoregressive Models
• Stage1 : Train multi-scale Image Tokenization for Images, which Tokenize Images into discrete token index
• Stage2: Train GPT-style(AutoRegressive in scale space ) transformers on VQ-Tokens with teacher forcing
11. Visual Autoregressive Models
ImageNet 512×512 conditional generation
• Sota performance on imagenet benchmark.
• Super fast inference speed
• VAR significantly advances traditional AR capabilities.
• Better perfomance than DiT(sora base model)
ImageNet 256×256 conditional generation
To our knowledge, this is the first time of autoregressive models outperforming Diffusion transformers!
12. Diffusion Models
* Credit: [2020] Denoising Diffusion Probabilistic Models
13. Diffusion Models
* Credit: [2021] High-Resolution Image Synthesis with Latent Diffusion Models
14. Diffusion Models
* Credit: [2022] Scalable Diffusion Models with Transformers
15. VAR v.s. Diffusion
VAR
vs.
Diffusion
➢ VAR's method of noising is more intuitive and interpretable (blurry to clear or low frequency to high frequency).
➢ VAR's learning process is potentially more efficient than diffusion (1 / 7 epochs of DiT).
➢ Similar to LLMs, VAR trains all timesteps simultaneously in a single forward pass, whereas diffusion trains only one
timestep t at a time.
➢ Both VAR and diffusion allow for "corrections," enabling the model to rectify errors from past timesteps, a capability not
available in AR models.
16. 03
Infinity: 视觉自回归生成新路线
17. Extend VAR to T2I Generation
Class-conditional to Image(C2I)
Text to Image(T2I)
18. Challenges
Poor discrete reconstrution: index-wise discrete tokenizer with a limited vocabulary size
employed faces significant quantization errors
Ground Truth
VAR Reconstrution
19. Challenges
Cumulative errors:train-test discrepancies from teacher forcing training, inherent to LLMs,
amplify cumulative errors in visual details.
20. Challenges
High-resolution: VAR has not yet been validated for generating high-resolution (1024×1024)
realistic images in complex text-to-image tasks.
21. Challenges
These limitation lead us to ask the following questions:
➢ Can discrete image tokenizers achieve reconstruction performance comparable to
state-of-the-art continuous VAEs, especially in preserving high-frequency texture
details?
➢ Can visual autoregressive generation overcome the accumulation issues caused by
teacher forcing and maintain strong robustness in long-sequence generation?
➢ Can discrete visual autoregressive modeling generate high-resolution, instruction-
compliant images in complex text-to-image tasks comparable to state-of-the-art
diffusion models (e.g., FLUX Dev, SD3)?
22. Infinity
Infinity redefines VAR under a bitwise modeling framework
Bitwise modeling:
➢ Bitwise Tokenizer
➢ Infinite-Vocabulary Classifier
➢ Bitwise Self-Correction
23. Bitwise Tokenizer
Index-wise quantization -> Bitwise quantization
* Credit: [2024] Image and Video Tokenization with Binary Spherical Quantization
24. Bitwise Tokenizer
An example for VQ and BSQ. We omit the channel dimension for simplicity
F
down
NN
F
down
+1 or -1
up
up
R1
R1
NN
F-R1
down
down
•
•
•
F-R1
+1 or -1
up
R2
up
R2
Vector Quantize
BSQ
up is up sample
down is down sample
NN is nearest neighbour
25. Bitwise Token
➢ Infinite-Vocabulary Classifier: predicts d bits instead of preidicting 2 d indices
➢ Robust to slight perturbation and error
26. Bitwise Tokenizer
By Scaling Visual Tokenizer’s Vocabulary, our tokenizer
improves and even surpasses continuous VAE of SD on
ImageNet-rFID.
27. Bitwise Tokenizer
28. Bitwise Self Correction
Bitwise Self-Correction to mitigate the issue of train-test
discrepancy
➢ Random flip bits to imitates bitwise prediction errors
➢ Re-quantizing the residual features to auto-correct the
previous errors
29. Bitwise Self Correction
Qualitatively, Substantial advantages are observed after applying bitwise self-correction.
30. Bitwise Self Correction
Using Bitwise Self-Correction generates better results (right)
31. Scaling Vocabulary Size benefits Generation
32. Scaling Transformer
33. DPO
34. DPO
35. Great Speed Advantage
36. State-of-the-art Generation Results
37. State-of-the-art Generation Results
38. State-of-the-art Generation Results
39. T2I Arena
40. 04
分析和思考
41. Infinity
➢ An autoregressive model with Bitwise Modeling, which significantly improves the
scaling and visual detail representation capabilities of discrete generative models.
➢ Demonstrates the potential of scaling tokenizers and transformers by achieving near-
continuous tokenizer performance
➢ Enables a discrete autoregressive text-to-image model to achieve exceptionally strong
prompt adherence and superior image generation quality.
42.
43. THANKS
探索 AI 应用边界
Explore the limits of AI applications