Qwen-Image-Layered- Towards Inherent Editability via Layer Decomposition
如果无法正常显示,请先停止浏览器的去广告插件。
1. Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition
Shengming Yin1 Zekai Zhang2 Zecheng Tang2 Kaiyuan Gao2
Xiao Xu2 Kun Yan2 Jiahao Li2 Yilei Chen2 Yuxiang Chen2
Heung-Yeung Shum3 Lionel M. Ni1 Jingren Zhou2 Junyang Lin2 Chenfei Wu2 *
1
HKUST(GZ) 2 Alibaba 3 HKUST
arXiv:2512.15603v1 [cs.CV] 17 Dec 2025
Input Image
Recolor
Output Layers
Replace
Revise
Remove
Resize
Reposition
Figure 1. Qwen-Image-Layered is capable of decomposing an input image into multiple semantically disentangled RGBA layers, thereby
enabling inherent editability, where each layer can be independently manipulated without affecting other content.
Abstract
RGBA-VAE to unify the latent representations of RGB and
RGBA images; (2) a VLD-MMDiT (Variable Layers De-
composition MMDiT) architecture capable of decomposing
a variable number of image layers; and (3) a Multi-stage
Training strategy to adapt a pretrained image generation
model into a multilayer image decomposer. Furthermore, to
address the scarcity of high-quality multilayer training im-
ages, we build a pipeline to extract and annotate multilayer
images from Photoshop documents (PSD). Experiments
demonstrate that our method significantly surpasses exist-
ing approaches in decomposition quality and establishes a
new paradigm for consistent image editing. Our code and
models are released on https://github.com/QwenLM/Qwen-
Image-Layered
Recent visual generative models often struggle with con-
sistency during image editing due to the entangled nature of
raster images, where all visual content is fused into a sin-
gle canvas. In contrast, professional design tools employ
layered representations, allowing isolated edits while pre-
serving consistency. Motivated by this, we propose Qwen-
Image-Layered, an end-to-end diffusion model that decom-
poses a single RGB image into multiple semantically disen-
tangled RGBA layers, enabling inherent editability, where
each RGBA layer can be independently manipulated with-
out affecting other content. To support variable-length de-
composition, we introduce three key components: (1) an
* Corresponding author.
1
2. Figure 2. Visualization of Image-to-Multi-RGBA (I2L) on open-domain images. The leftmost column in each group shows the input
image. Qwen-Image-Layered is capable of decomposing diverse images into high-quality, semantically disentangled layers, where each
layer can be independently manipulated without affect other content.
2
3. Figure 3. Visualization of Image-to-Multi-RGBA (I2L) on images containing texts. The leftmost column in each group shows the input
image. Qwen-Image-Layered is capable of accurately decomposing both text and objects into semantically disentangled layers.
3
4. 1. Introduction
pretrained image generation model into an multilayer image
decomposer. Furthermore, to address the scarcity of high-
quality multilayer image data, we develop a data pipeline to
filter and annotate multilayer images from real-world Pho-
toshop documents (PSD).
Recent advances in visual generative models have enabled
impressive image synthesis capabilities [5, 10–12, 24, 31,
34, 41, 42]. However, in the context of image editing,
achieving precise modifications while preserving the struc-
ture and semantics of unedited regions remains a significant
challenge. This issue typically appears as semantic drift
(e.g. unintended changes to a person’s identity) and geo-
metric misalignment (e.g. shifts in object position or scale).
Existing editing approaches fail to fundamentally ad-
dress this problem. Global editing methods [4, 9, 21, 26, 39,
43, 48], which resample the entire image in the latent space
of generative models, are inherently limited by the stochas-
tic nature of probabilistic generation and thus cannot ensure
consistency in unedited regions. Meanwhile, mask-guided
local editing methods [8, 29, 35] restrict modification within
user-specified masks. However, in complex scenes, espe-
cially those involving occlusion or soft boundaries, the ac-
tual editing region is often ambiguous, thus failing to fun-
damentally solve the consistency problem.
Rather than tackling this issue purely through model de-
sign or data engineering, we argue that the core challenge
lies in the representation of images themselves. Traditional
raster images are flat and entangled: all visual content is
fused into a single canvas, with semantics and geometry
tightly coupled. Consequently, any edit inevitably prop-
agates through this entangled pixel space, leading to the
aforementioned inconsistencies.
To overcome this fundamental limitation, we advocate
for a naturally disentangled image representation. Specif-
ically, we propose representing an image as a stack of se-
mantically decomposed RGBA layers, as illustrated in the
upper part of Fig. 1. This layered structure enables inherent
editability with built-in consistency: edits are applied exclu-
sively to the target layer, physically isolating them from the
rest of the content, and thereby eliminating semantic drift
and geometric misalignment. Moreover, such a layer-wise
representation naturally supports high-fidelity elementary
operations—such as resizing, repositioning, and recoloring,
as demonstrated in the lower part of Fig. 1.
Based on this insight, we introduce Qwen-Image-
Layered, an end-to-end diffusion model that directly de-
composes a single RGB image into multiple semantically
disentangled RGBA layers. Once decomposed, each layer
can be independently manipulated while leaving all other
content exactly unchanged—enabling truly consistent im-
age editing. To support variable-length decomposition, our
image decomposer is built upon three key designs: (1)
an RGBA-VAE that establishes a shared latent space for
both RGB and RGBA images; (2) a VLD-MMDiT (Vari-
able Layers Decomposition MMDiT) architecture that en-
ables training with a variable number of layers; and (3)
a Multi-stage Training strategy that progressively adapts a
We summarize our contributions as follows:
• We propose Qwen-Image-Layered, an end-to-end diffu-
sion model that decomposes an image into multiple high-
quality, semantically disentangled RGBA layers, thereby
enabling inherently consistent image editing.
• We design the image decomposer from three aspects: 1)
an RGBA-VAE to provide shared latent space for RGB
and RGBA images. 2) a VLD-MMDiT architecture to fa-
cilitate decomposition with variable number of layers. 3)
a Multi-stage Training strategy to adapt a pretrained im-
age generation model to a multilayer image decomposer.
• We develop a data processing pipeline to extract and an-
notate multilayer images from Photoshop documents, ad-
dressing the lack of high-quality multilayer images.
• Extensive experiments demonstrate that Qwen-Image-
Layered not only outperforms existing methods in de-
composition quality but also unlocks new possibilities for
consistent, layer-based image editing and synthesis.
2. Related Work
2.1. Image Editing
Image editing has made significant progress in recent years
and can be broadly categorized into two paradigms: global
editing and mask-guided local editing. Global editing meth-
ods [4, 9, 21, 26, 39, 42, 43, 48] regenerate the entire im-
age to achieve holistic modifications, such as expression
editing and style transfer. Among these, Qwen-Image-
Edit [42] leverages two distinct yet complementary feature
representations—semantic features from Qwen-VL [3] and
reconstructive features from VAE [19]—to enhance consis-
tency. However, due to the inherent stochasticity of gener-
ative models, these approaches cannot ensure consistency
in unedited regions. In contrast, mask-guided local editing
methods [8, 29, 35] constrain modifications within a spec-
ified mask to preserve global consistency. DiffEdit [8], for
instance, first automatically generates a mask to identify re-
gions requiring modification and then edits the target area.
Although intuitive, these approaches struggle with occlu-
sions and soft boundaries, making it difficult to precisely
identify the actual editing region and thus failing to funda-
mentally resolve the consistency issue. Unlike these works,
we propose decomposing the image into semantically dis-
entangled RGBA layers, where each layer can be inde-
pendently modified while keeping the others unchanged,
thereby fundamentally ensuring consistent across edits.
4
5. 2.2. Image Decomposition3. Method
Numerous studies have attempted to decompose images
into layers. Early approaches addressed this problem by
performing segmentation in color space [2, 20, 37]. Sub-
sequent work has focused on object-level decomposition
in natural scenes [28, 30, 47]. Among these, PCNet [47]
learns to recover fractional object masks and contents in
a self-supervised manner. More recent research has ex-
plored decomposing images into multiple RGBA layers [7,
18, 36, 38, 45]. One class of these methods leverages seg-
mentation [33] or matting [23] to extract foreground ob-
jects, followed by image inpainting [46] to reconstruct the
background. For instance, LayerD [36] iteratively extracts
the topmost unoccluded foreground layer and completes
the background. Accordion [7] proposes using Vision-
Language Models [25] to guide this decomposition process.
Another category of work introduces mask-guided, object-
centric image decomposition [18, 45], which decomposes
an image into foreground and background layers based on
a provided mask. These methods generally require seg-
mentation to provide initial mask. However, segmentation
often struggles with complex spatial layouts and the pres-
ence of multiple semi-transparent layers, resulting in low-
quality layers. Moreover, multilayer decomposition typi-
cally requires recursive inference, leading to error propaga-
tion. Consequently, existing methods fail to produce com-
plete, high-fidelity RGBA layers suitable for editing. In
contrast to the aforementioned approaches, Qwen-Image-
Layered employs an end-to-end framework to decompose
input images directly into multiple high-quality RGBA lay-
ers, thereby enhancing decomposition quality and enabling
consistency-preserving image editing.We propose an end-to-end layering approach that directly
decomposes an input RGB image I ∈ RH×W ×3 into N
RGBA layers L ∈ RN ×H×W ×4 , where each layer Li com-
prises a color component RGBi and an alpha matte αi , i.e.
Li = [RGBi ; αi ]. The original image can be reconstructed
by sequential alpha blending as follows:
C0 = 0
Ci = αi · RGBi + (1 − αi ) · Ci−1
i = 1, . . . , N
where Ci denotes the composite of the first i layers, and the
final composite satisfies I = CN . Building upon Qwen-
Image [42], we develop Qwen-Image-Layered from the fol-
lowing three aspects:
• 1) In contrast to previous decomposer [45] that employs
separate VAEs, we propose an RGBA-VAE that encodes
both RGB and RGBA images. This approach narrows the
latent distribution gap between the input RGB image and
the output RGBA layers.
• 2) Unlike prior methods that decompose images into fore-
ground and background [18, 45], we propose a VLD-
MMDiT (Variable Layers Decomposition MMDiT),
which supports decomposition into a variable number of
layers and is compatible with multi-task training.
• 3) To progressively adapt pretrained image generation
model into a multilayer image decomposer, we design a
multi-stage, multi-task training scheme that progressively
evolves from simpler tasks to more complex ones.
3.1. RGBA-VAE
Variational Autoencoders (VAEs) [19] are commonly em-
ployed in diffusion models [34] to reduce the dimensional-
ity of the latent space, thereby improving both training and
sampling efficiency. In previous work, LayeringDiff [18]
utilized an RGB VAE to first generate the foreground layer
and subsequently applied an additional module to obtain
transparency. LayerDecomp [45] adopted separate VAEs
for the input RGB image and the output RGBA layers, re-
sulting in a distribution gap between the input and output
representations. To address these limitations, we propose
RGBA VAE, a four-channel VAE designed to process both
RGB and RGBA images.
Inspired by AlphaVAE [40], we extend the first convolu-
tion layer of the Qwen-Image VAE encoder E and the last
convolution layer of the decoder D from three to four chan-
nels. To enable reconstruction of both RGB and RGBA im-
ages, we train it using both types of images. For RGB im-
ages, the alpha channel is set to 1. To maintain RGB recon-
struction performance during initialization, we employ the
following initialization strategy. Let WE0 ∈ RD0 ×4×k×k×k
and b0E ∈ RD0 denote the weight and bias of the first convo-
lution layer in the encoder, and WDl ∈ R4×Dl ×k×k×k and
2.3. Multilayer Image Synthesis
Multilayer image synthesis has also garnered sustained at-
tention [6, 15–18, 32, 49, 50]. As a pioneer in layered
image generation, Text2Layer [50] first trains a two-layer
image autoencoder [19] and subsequently trains a diffu-
sion model [13] on the latent representations, enabling the
creation of two-layer images. LayerDiffusion [49] intro-
duces latent transparency into VAE and employs two dif-
ferent LoRA [14] with shared attention to generate fore-
ground and background. Through carefully designed inter-
layer and intra-layer attention mechanisms, LayerDiff [17]
is able to synthesize semantically consistent multilayer im-
ages. To achieve controllable multilayer image generation,
ART [32] proposes an anonymous region layout to explic-
itly control the layout. LayeringDiff [18] first generates a
raster image using existing text-to-image models, and then
decomposes it into foreground and background based on a
mask. Qwen-Image-Layered is capable of decomposing AI-
generated raster images into multiple RGBA layers, thus
enabling multilayer image generation.
5
6. UnPatchify
(-1, -1, -1)
VLD-MMDiT Block
(-1, 0, -1)
(0, -1, -1)
…
×N
Decomposing
(-1, 1, -1)
(0, 0, -1)
(0, 1, -1)
(-1, -1, 0)
VLD-MMDiT Block
(1, -1, -1)
(1, 0, -1)
width
(1, 1, -1)
(0, -1, 0)
(-1, -1, 1)
Patchify
Qwen2.5 VL
Patchify
(2, -1, -1)(2, 0, -1)(2, 1, -1)
(2, -1, 0)(2, 0, 0)(2, 1, 0)
(2, -1, 1)(2, 0, 1)(2, 1, 1)
(1, -1, 0)
!
(0, -1, 1)
Noise
RGBA-VAE
Encoder
Text Prompt
Input Image
RGBA-VAE
Encoder
(1, -1, 1)
height
Target Layers
r
ye
la
The image features
a stylized urban
scene split into two
contrasting halves.
On the left side,
there is a
grayscale …
Figure 4. Overview of Qwen-Image-Layered. Left: Illustration of our proposed VLD-MMDiT (Variable Layers Decomposition MMDiT),
where the input RGB image and the target RGBA layers are both encoded by our proposed RGBA-VAE. During attention computation,
these two sequences are concatenated along the sequence dimension, thereby enhancing inter-layer and intra-layer interactions. Right:
Illustration of Layer3D RoPE, where a new layer dimension is introduced to support a variable number of layers.
blD ∈ R4 denote those of the last convolution layer in the
decoder, where k is the kernel size. We copy the parameters
from the pretrained RGB VAE into the first three channels
and set the newly initialized parameters as
WE0 [:, 3, :, :, :] = 0
WDl [3, :, :, :, :] = 0
cording to Rectified Flow [27], the intermediate state xt and
velocity vt at timestep t is defined as
xt = tx0 + (1 − t)x1
dxt
= x0 − x1
vt =
dt
blD [3] = 1
For the input RGB image I, we also use RGBA-VAE to en-
code it as a latent representation zI ∈ Rh×w×c . Following
Qwen-Image, the text prompt is encoded into text condition
h with MLLM. In practice, we can use Qwen2.5-VL [3]
to automatically generate the caption for the input image.
Then, the model is trained to predict the target velocity with
loss function defined as the mean squared error between the
predicted velocity vθ (xt , t, zI , h) and the ground truth vt :
For the training objective, we use a combination of re-
construction loss, perceptual loss, and regularization loss.
After training, both the input RGB image and the output
RGBA layers are encoded into a shared latent space, where
each RGBA layer is encoded independently. Notably, these
layers exhibit no cross-layer redundancy; consequently, no
compression is applied along the layer dimension.
3.2. Variable Layers Decomposition MMDiT
L = E(x0 ,x1 ,t,zI ,h)∼D ||vθ (xt , t, zI , h) − vt ||2
Previous studies [7, 18, 36, 45] typically decompose images
into background and foreground, requiring recursive infer-
ence to perform multilayer decomposition. Instead, Qwen-
Image-Layered proposes VLD-MMDiT (Variable Layers
Decomposition MMDiT) to facilitate the decomposition of
a variable number of layers.
For Qwen-Image-Layered, it tasks an RGB image I ∈
RH×W ×3 as input and decomposes it into multiple RGBA
layers L ∈ RN ×H×W ×4 . Following Qwen-Image, we
adopt the Flow Matching training objective. Formally, let
x0 ∈ RN ×h×w×c denote the latent representation of the
target RGBA layers L, i.e., x0 = E(L). Then we sample
noise x1 from standard multivariate normal distribution and
a timestep t ∈ [0, 1] from a logit-normal distribution. Ac-
where D denotes the training dataset.
Previous studies [16, 17] have achieved multilayer image
generation through sophisticatedly designed inter-layer and
intra-layer attention mechanisms. In contrast, we employ
a Multi-Modal attention [10] to directly model these rela-
tionships, as shown in the left part of Fig. 4. Specifically,
we apply 2× patchification to the noise-free input image zI
and the intermediate state xt along the height and width di-
mensions. In each VLD-MMDiT block, two separate sets of
parameters are used to process textual h and visual informa-
tion zI , xt respectively. During attention computation, we
concatenate these three sequences, thereby directly model-
ing both intra-layer and inter-layer interactions.
6
7. As shown in the right part of Fig. 4, we propose a
Layer3D RoPE within each VLD-MMDiT block to enable
the decomposition of a variable number of layers, while
supporting various tasks. Our design is inspired by the
MSRoPE from Qwen-Image [42], where the positional en-
coding in each layer is shifted towards the center. To ac-
commodate a variable number of layers, we introduce an
additional layer dimension. For the intermediate state xt ,
the layer index starts from 0, and increases accordingly. For
conditional image input zI , we assign a layer index of -1,
ensuring a clear distinction from any positive layer indices
used in other tasks, e.g. text-to-multilayer image generation.
(a) Distribution of Layer Counts
(b) Category Distribution
Figure 5. Statistics of the processed multilayer image dataset. (a)
Distribution of layer counts before and after merging. (b) Category
distribution in the final dataset.
3.3. Multi-stage Training
pipeline to filter and annotate multilayer images derived
from real world PSD (Photoshop Document) files.
We began by collecting a large corpus of PSD files and
extracting all layers using psd-tools, an open-source
Python library for parsing Adobe Photoshop documents.
To ensure data quality, we filtered out layers containing
anomalous elements, such as blurred faces. To improve
decomposition performance, we removed non-contributing
layers that do not influence the final composite image. Fur-
thermore, given that some PSD files contain hundreds of
layers—thereby increasing model complexity—we merged
spatially non-overlapping layers to reduce the total layer
count. As shown in Fig. 5a, this operation substantially
reduces the number of layers. Finally, we employed
Qwen2.5-VL [3] to generate text descriptions for the com-
posite images, enabling Text-to-Multi-RGBA generation.
Directly finetuning a pretrained image generation model to
perform image decomposition poses significant challenges,
as it not only requires adapting to a new VAE but also in-
volves learning new tasks. To address this issue, we propose
a multi-stage, multi-task training scheme that progressively
evolves from simpler tasks to more complex ones.
Stage 1: From Text-to-RGB to Text-to-RGBA. We be-
gin by adapting MMDiT to the latent space of RGBA VAE.
At this stage, we replace the original VAE and train the
model jointly on both text-to-RGB and text-to-RGBA gen-
eration tasks. This enables the model to generate not only
standard raster images (RGB) but also images with trans-
parency (RGBA).
Stage 2: From Text-to-RGBA to Text-to-Multi-
RGBA. Initially, the image generator is capable of produc-
ing only a single image. To support multilayer generation
and adapt to the newly initialized layer dimension, we in-
troduce a text-to-multiple-RGBA generation task. Follow-
ing ART [32], the model is trained to jointly predict both
the final composite image and its corresponding transpar-
ent layers, thereby facilitating information propagation be-
tween the composite image and its layers. We refer to this
model as Qwen-Image-Layered-T2L.
Stage 3: From Text-to-Multi-RGBA to Image-to-
Multi-RGBA. Up to this point, all tasks have been con-
ditioned exclusively on textual prompts. In this stage, we
introduce an additional image input, as detailed in Sec. 3.2,
extending the model’s capability to decompose a given
RGB image into multiple RGBA layers. We refer to this
model as Qwen-Image-Layered-I2L.
4.2. Implementation Details
Building upon Qwen-Image [42], we developed Qwen-
Image-Layered. The model was trained using the Adam
optimizer [1] with a learning rate of 1 × 10−5 . For Text-
to-RGB and Text-to-RGBA generation, training was per-
formed on an internal dataset. For both Text-to-Multi-
RGBA and Image-to-Multi-RGBA generation, the model
was optimized on our proposed multilayer image dataset,
with the maximum number of layers set to 20. The training
process was conducted in three stages, comprising 500K,
400K, and 400K optimization steps, respectively.
4.3. Quantitative Results
4.3.1. Image Decomposition
To quantitatively evaluate image decomposition, we adopt
the evaluation protocol introduced by LayerD [36]. This
protocol aligns layer sequences of varying lengths using
order-aware Dynamic Time Warping and allows for the
merging of adjacent layers to account for inherent ambi-
guities in decomposition (i.e., a single image may have
multiple plausible decompositions). Quantitative results on
Crello dataset [44] are reported in Tab. 1. Following Lay-
erD [36], we report two metrics: RGB L1 (the L1 distance
4. Experiment
4.1. Data Collection and Annotation
Due to the scarcity of high-quality multilayer images, pre-
vious studies [17, 18, 36, 50] have largely relied on either
synthetic data [38] or simple graphic design datasets (e.g.,
Crello [44]), which typically lack complex layouts or semi-
transparent layers. To bridge this gap, we developed a data
7
8. Output Layer 1
Output Layer 2
Output Layer 3
Output Layer 4
Qwen-Image Layered
LayerD
Qwen-Image Layered
LayerD
Input Image
Figure 6. Qualitative comparison of Image-to-Multi-RGBA (I2L). The leftmost column shows the input image; the subsequent columns
present the decomposed layers. Notably, LayerD [36] exhibits inpainting artifacts (Output Layer 1) and inaccurate segmentation (Output
Layer 2 and 3), while our method produces high-quality, semantically disentangled layers, suitable for inherently consistent image editing.
Table 1. Quantitative comparison of Image-to-Multi-RGBA (I2L) on Crello dataset [44]. RGB L1: L1 distance between RGB channels
weighted by the ground-truth alpha. Alpha soft IoU: soft IoU between predicted and ground-truth alpha channel.
Metric
RGB L1↓
Alpha soft IoU↑
# Max Allowed Layer Merge012345012345
VLM Base + Hi-SAM [7]
Yolo Base + Hi-SAM
LayerD [36]0.1197
0.0962
0.07090.1029
0.0833
0.05410.0892
0.0710
0.04570.0807
0.0630
0.04190.0755
0.0592
0.04030.0726
0.0579
0.03960.5596
0.5697
0.75200.6302
0.6537
0.81110.6860
0.7169
0.84350.7222
0.7567
0.85640.7465
0.7811
0.86220.7589
0.7897
0.8650
Qwen-Image-Layered-I2L0.05940.04900.03930.03770.03640.03630.87050.88630.91050.91210.91560.9160
of the RGB channels weighted by the ground-truth alpha)
and Alpha soft IoU (the soft IoU between predicted and
ground-truth alpha channels). Due to a significant distribu-
tion gap between the Crello dataset and our proposed mul-
tilayer dataset—such as differences in the number of layers
and the presence of semi-transparent layers—we finetune
our model on Crello training set. As shown in Tab. 1, our
method achieves the highest decomposition accuracy, no-
tably achieving a significantly higher Alpha soft IoU score,
underscoring its superior ability in generating high-fidelity
alpha channels.
8
9. Input Image
Change to a diagonal compositionMove the pink words “Skate boarding” to the
front of the girl, make the girl looks bigger.
Move the man to the right, keep
the background unchangedMake the man face to the right and look
shorter, keep the background unchanged
Qwen-Image-Edit-2509
Qwen-Image-Layered
Qwen-Image-Edit-2509
Qwen-Image-Layered
Figure 7. Qualitative comparison of image editing. The leftmost column is the input image; prompts are listed above each row. Qwen-
Image-Edit-2509 [42] struggles with resizing and repositioning, tasks inherently supported by Qwen-Image-Layered. Meanwhile, Qwen-
Image-Edit-2509 introduces pixel-level shifts (last row), while Qwen-Image-Layered can ensure consistency by editing specific layers.
Table 2. Ablation study on Crello dataset [44]. L: Layer3D Rope, R: RGBA-VAE, M: Multi-stage Training.
MetricComponent
# Max Allowed Layer MergeL
Qwen-Image-Layered-I2L-w/o LRM
Qwen-Image-Layered-I2L-w/o RM
Qwen-Image-Layered-I2L-w/o MQwen-Image-Layered-I2L
R
RGB L1↓
M
12345012345
× × ×
✓ × ×
✓ ✓ ×0.2809
0.1894
0.16490.2567
0.1430
0.11780.2467
0.1255
0.10480.2449
0.1173
0.09920.2439
0.1138
0.09660.2435
0.1126
0.09590.3725
0.5844
0.65040.4540
0.6927
0.75830.5281
0.7576
0.80740.5746
0.7847
0.82430.5957
0.7954
0.83100.6031
0.7984
0.8331
✓ ✓ ✓0.05940.04900.03930.03770.03640.03630.87050.88630.91050.91210.91560.9160
Table 3. Quantitative comparison of RGBA image reconstruction
on the AIM-500 dataset [22].
ModelBase ModelPSNR↑LayerDiffuse [49]SDXL
SDXl
FLUX32.0879 0.9436
35.7446 0.9576
36.9439 0.973717.7023 0.0418
10.9178 0.0495
11.7884 0.0283
Qwen-Image38.8252 0.98025.3132
AlphaVAE [40]
RGBA-VAE
Alpha soft IoU↑
0
SSIM↑
rFID↓
the input RGB image while retaining RGBA-VAE for out-
put RGBA layers. For variants without Layer3D RoPE, we
replace it with standard 2D RoPE for positional encoding.
All ablation experiments follow the same evaluation pro-
tocol as described in Sec. 4.3.1. As shown in the third
and fourth rows, multi-stage training effectively improves
decomposition quality. Comparing the second and third
rows, the superior performance in the third row indicates
that RGBA VAE effectively eliminates the distribution gap,
thereby improving overall performance. Furthermore, the
comparison between the first and second rows illustrates the
necessity of Layer3D Rope: without it, the model can not
distinguish between different layers, thus failing to decom-
pose images into multiple meaningful layers.
LPIPS↓
0.0123
4.3.2. Ablation Study
We conducted an ablation study on Crello dataset [44] to
validate the effectiveness of our proposed method. The re-
sults are presented in Tab. 2. For settings without multi-
stage training, we initialize the model directly from pre-
trained text-to-image weights. For experiments without
RGBA-VAE, we employ the original RGB VAE to encode
9
10. Figure 8. Qualitative comparison of Text-to-Multi-RGBA (T2L). The rightmost column shows the composite image. The second row
directly generates layers from text (Qwen-Image-Layered-T2L); the third row first generates a raster image (Qwen-Image-T2I) then de-
composes it into layers (Qwen-Image-Layered-I2L). ART [32] fails to follow the prompt, while Qwen-Image-Layered-T2L produces
semantically coherent layers, and Qwen-Image-T2I + Qwen-Image-Layered-I2L further improves visual aesthetics.
4.3.3. RGBA Image Reconstruction
ticeable pixel-level shifts, as shown in the bottom row. By
contrast, layered representation enables precise editing of
individual layers while leaving others exactly untouched,
thereby achieving consistency-preserving editing.
Following AlphaVAE [40], we quantitatively evaluate
RGBA image reconstruction by blending the reconstructed
images over a solid-color background. Quantitative results
on AIM-500 dataset [22] are presented in Tab. 3, where
we compare our proposed RGBA VAE against LayerDif-
fuse [49] and AlphaVAE [40] in terms of PSNR, SSIM,
rFID, and LPIPS. As shown in Tab. 3, RGBA VAE achieves
the highest scores across all four metrics, demonstrating its
outstanding reconstruction capability.
4.4.3. Multilayer Image Synthesis
In Fig. 8, we present a qualitative comparison of Text-to-
Multi-RGBA generation. In the second row, we directly
employ Qwen-Image-Layered-T2L for text-conditioned
multilayer image synthesis. Alternatively, we first generate
a raster image from text using Qwen-Image-T2I [42] and
then decompose it into multiple layers using Qwen-Image-
Layered-I2L. As illustrated, ART [32] struggles to generate
semantically coherent multilayer images (e.g. missing bats
and cat). In contrast, Qwen-Image-Layered-T2L produces
semantically coherent multilayer compositions. Moreover,
the pipeline combining Qwen-Image-T2I and Qwen-Image-
Layered-I2L further leverages the knowledge embedded in
the text-to-image generator, enhancing both semantic align-
ment and visual aesthetics.
4.4. Qualitative Results
4.4.1. Image Decomposition
We present a qualitative comparison of image decompo-
sition with LayerD [36] in Fig. 6. Notably, LayerD pro-
duces low-quality decomposition layers due to inaccurate
segmentation (layers 2 and 3) and inpainting artifacts (layer
1), rendering its results unsuitable for editing. In contrast,
our model performs image decomposition in an end-to-
end manner without relying on external modules, yielding
more coherent and semantically plausible decompositions,
thereby facilitating inherently consistent image editing.
5. Conclusion
In this paper, we introduce Qwen-Image-Layered, an
end-to-end diffusion model that decomposes a single RGB
image into multiple semantically disentangled RGBA
layers. By representing images as a stack of layers, our
approach enables inherent editability: each layer can be
independently manipulated while leaving all other con-
tent exactly unchanged, thereby fundamentally ensuring
consistency across edits. Extensive experiments demon-
strate that our method significantly outperforms existing
approaches in decomposition quality and establishes a
new paradigm for consistency-preserving image editing.
4.4.2. Image Editing
In Fig. 7, we present a qualitative comparison with Qwen-
Image-Edit-2509 [42]. For Qwen-Image-Layered, we first
decompose the input image into multiple semantically dis-
entangled RGBA layers and then apply simple manual ed-
its. As illustrated, Qwen-Image-Edit-2509 struggles to fol-
low instructions involving layout modifications, resizing,
or repositioning. In contrast, Qwen-Image-Layered inher-
ently supports these elementary operations with high fi-
delity. Moreover, Qwen-Image-Edit-2509 introduces no-
10
11. References
[14] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al.
Lora: Low-rank adaptation of large language models. ICLR,
1(2):3, 2022. 5
[15] Dingbang Huang, Wenbo Li, Yifei Zhao, Xinyu Pan, Yan-
hong Zeng, and Bo Dai. Psdiffusion: Harmonized multi-
layer image generation via layout and appearance alignment.
arXiv preprint arXiv:2505.11468, 2025. 5
[16] Junjia Huang, Pengxiang Yan, Jinhang Cai, Jiyang Liu,
Zhao Wang, Yitong Wang, Xinglong Wu, and Guanbin Li.
Dreamlayer: Simultaneous multi-layer generation via diffu-
sion mode. arXiv preprint arXiv:2503.12838, 2025. 6
[17] Runhui Huang, Kaixin Cai, Jianhua Han, Xiaodan Liang,
Renjing Pei, Guansong Lu, Songcen Xu, Wei Zhang, and
Hang Xu. Layerdiff: Exploring text-guided multi-layered
composable image synthesis via layer-collaborative diffu-
sion model. In European Conference on Computer Vision,
pages 144–160. Springer, 2024. 5, 6, 7
[18] Kyoungkook Kang, Gyujin Sim, Geonung Kim, Donguk
Kim, Seungho Nam, and Sunghyun Cho. Layeringdiff: Lay-
ered image synthesis via generation, then disassembly with
generative knowledge. arXiv preprint arXiv:2501.01197,
2025. 5, 6, 7
[19] Diederik P Kingma and Max Welling. Auto-encoding varia-
tional bayes. arXiv preprint arXiv:1312.6114, 2013. 4, 5
[20] Yuki Koyama and Masataka Goto. Decomposing images into
layers with advanced color blending. In Computer Graphics
Forum, pages 397–407. Wiley Online Library, 2018. 5
[21] Black Forest Labs, Stephen Batifol, Andreas Blattmann,
Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock-
horn, Jack English, Zion English, Patrick Esser, et al. Flux.
1 kontext: Flow matching for in-context image generation
and editing in latent space. arXiv preprint arXiv:2506.15742,
2025. 4
[22] Jizhizi Li, Jing Zhang, and Dacheng Tao. Deep automatic
natural image matting. arXiv preprint arXiv:2107.07235,
2021. 9, 10
[23] Jiachen Li, Jitesh Jain, and Humphrey Shi. Matting anything.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 1775–1785, 2024. 5
[24] Jian Liang, Chenfei Wu, Xiaowei Hu, Zhe Gan, Jianfeng
Wang, Lijuan Wang, Zicheng Liu, Yuejian Fang, and Nan
Duan. Nuwa-infinity: Autoregressive over autoregressive
generation for infinite visual synthesis. Advances in Neural
Information Processing Systems, 35:15420–15432, 2022. 4
[25] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.
Visual instruction tuning. Advances in neural information
processing systems, 36:34892–34916, 2023. 5
[26] Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang,
Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chun-
rui Han, et al. Step1x-edit: A practical framework for general
image editing. arXiv preprint arXiv:2504.17761, 2025. 4
[27] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow
straight and fast: Learning to generate and transfer data with
rectified flow. arXiv preprint arXiv:2209.03003, 2022. 6
[28] Zhengzhe Liu, Qing Liu, Chirui Chang, Jianming Zhang,
Daniil Pakhomov, Haitian Zheng, Zhe Lin, Daniel Cohen-Or,
[1] Kingma DP Ba J Adam et al. A method for stochastic op-
timization. arXiv preprint arXiv:1412.6980, 1412(6), 2014.
7
[2] Yağiz Aksoy, Tunç Ozan Aydin, Aljoša Smolić, and Marc
Pollefeys. Unmixing-based soft color segmentation for im-
age manipulation. ACM Transactions on Graphics (TOG),
36(2):1–19, 2017. 5
[3] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin
Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun
Tang, et al. Qwen2. 5-vl technical report. arXiv preprint
arXiv:2502.13923, 2025. 4, 6, 7
[4] Tim Brooks, Aleksander Holynski, and Alexei A Efros. In-
structpix2pix: Learning to follow image editing instructions.
In Proceedings of the IEEE/CVF conference on computer vi-
sion and pattern recognition, pages 18392–18402, 2023. 4
[5] Qi Cai, Yehao Li, Yingwei Pan, Ting Yao, and Tao Mei.
Hidream-i1: An open-source high-efficient image genera-
tive foundation model. In Proceedings of the 33rd ACM In-
ternational Conference on Multimedia, pages 13636–13639,
2025. 4
[6] Junwen Chen, Heyang Jiang, Yanbin Wang, Keming Wu,
Ji Li, Chao Zhang, Keiji Yanai, Dong Chen, and Yuhui
Yuan. Prismlayers: Open data for high-quality multi-
layer transparent image generative models. arXiv preprint
arXiv:2505.22523, 2025. 5
[7] Jingye Chen, Zhaowen Wang, Nanxuan Zhao, Li Zhang, Di-
fan Liu, Jimei Yang, and Qifeng Chen. Rethinking layered
graphic design generation with a top-down approach. In
Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 16861–16870, 2025. 5, 6, 8
[8] Guillaume Couairon, Jakob Verbeek, Holger Schwenk,
and Matthieu Cord. Diffedit: Diffusion-based seman-
tic image editing with mask guidance. arXiv preprint
arXiv:2210.11427, 2022. 4
[9] Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou,
Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie,
Ziang Song, et al. Emerging properties in unified multimodal
pretraining. arXiv preprint arXiv:2505.14683, 2025. 4
[10] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim
Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik
Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti-
fied flow transformers for high-resolution image synthesis.
In Forty-first international conference on machine learning,
2024. 4, 6
[11] Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao
Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang
Liu, et al. Seedream 3.0 technical report. arXiv preprint
arXiv:2504.11346, 2025.
[12] Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen
Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun
Shi, et al. Seedream 2.0: A native chinese-english bilin-
gual image generation foundation model. arXiv preprint
arXiv:2503.07703, 2025. 4
[13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-
fusion probabilistic models. Advances in neural information
processing systems, 33:6840–6851, 2020. 5
11
12. [40] Zile Wang, Hao Yu, Jiabo Zhan, and Chun Yuan. Alphavae:
Unified end-to-end rgba image reconstruction and genera-
tion with alpha-aware representation learning. arXiv preprint
arXiv:2507.09308, 2025. 5, 9, 10
[41] Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang,
Daxin Jiang, and Nan Duan. Nüwa: Visual synthesis pre-
training for neural visual world creation. In European con-
ference on computer vision, pages 720–736. Springer, 2022.
4
[42] Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan
Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei
Chen, et al. Qwen-image technical report. arXiv preprint
arXiv:2508.02324, 2025. 4, 5, 7, 9, 10
[43] Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin
Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie
Zhou, et al. Omnigen2: Exploration to advanced multimodal
generation. arXiv preprint arXiv:2506.18871, 2025. 4
[44] Kota Yamaguchi. Canvasvae: Learning to generate vector
graphic documents. In Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision, pages 5481–5489,
2021. 7, 8, 9
[45] Jinrui Yang, Qing Liu, Yijun Li, Soo Ye Kim, Daniil Pakho-
mov, Mengwei Ren, Jianming Zhang, Zhe Lin, Cihang Xie,
and Yuyin Zhou. Generative image layer decomposition with
visual effects. In Proceedings of the Computer Vision and
Pattern Recognition Conference, pages 7643–7653, 2025. 5,
6
[46] Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin
Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything:
Segment anything meets image inpainting. arXiv preprint
arXiv:2304.06790, 2023. 5
[47] Xiaohang Zhan, Xingang Pan, Bo Dai, Ziwei Liu, Dahua
Lin, and Chen Change Loy. Self-supervised scene de-
occlusion. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 3784–3792,
2020. 5
[48] Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su.
Magicbrush: A manually annotated dataset for instruction-
guided image editing. Advances in Neural Information Pro-
cessing Systems, 36:31428–31449, 2023. 4
[49] Lvmin Zhang and Maneesh Agrawala. Transparent image
layer diffusion using latent transparency. arXiv preprint
arXiv:2402.17113, 2024. 5, 9, 10
[50] Xinyang Zhang, Wentian Zhao, Xin Lu, and Jeff Chien.
Text2layer: Layered image generation using latent diffusion
model. arXiv preprint arXiv:2307.09781, 2023. 5, 7
and Chi-Wing Fu. Object-level scene deocclusion. In ACM
SIGGRAPH 2024 Conference Papers, pages 1–11, 2024. 5
[29] Qi Mao, Lan Chen, Yuchao Gu, Zhen Fang, and Mike Zheng
Shou. Mag-edit: Localized image editing in complex scenar-
ios via mask-based attention-adjusted guidance. In Proceed-
ings of the 32nd ACM International Conference on Multime-
dia, pages 6842–6850, 2024. 4
[30] Tom Monnier, Elliot Vincent, Jean Ponce, and Mathieu
Aubry. Unsupervised layered image decomposition into ob-
ject prototypes. In Proceedings of the IEEE/CVF inter-
national conference on computer vision, pages 8640–8650,
2021. 5
[31] Dustin Podell, Zion English, Kyle Lacey, Andreas
Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and
Robin Rombach. Sdxl: Improving latent diffusion mod-
els for high-resolution image synthesis. arXiv preprint
arXiv:2307.01952, 2023. 4
[32] Yifan Pu, Yiming Zhao, Zhicong Tang, Ruihong Yin, Haox-
ing Ye, Yuhui Yuan, Dong Chen, Jianmin Bao, Sirui Zhang,
Yanbin Wang, et al. Art: Anonymous region transformer for
variable multi-layer transparent image generation. In Pro-
ceedings of the Computer Vision and Pattern Recognition
Conference, pages 7952–7962, 2025. 5, 7, 10
[33] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang
Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman
Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2:
Segment anything in images and videos. arXiv preprint
arXiv:2408.00714, 2024. 5
[34] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models. In Proceedings of
the IEEE/CVF conference on computer vision and pattern
recognition, pages 10684–10695, 2022. 4, 5
[35] Enis Simsar, Alessio Tonioni, Yongqin Xian, Thomas Hof-
mann, and Federico Tombari. Lime: localized image edit-
ing via attention regularization in diffusion models. In 2025
IEEE/CVF Winter Conference on Applications of Computer
Vision (WACV), pages 222–231. IEEE, 2025. 4
[36] Tomoyuki Suzuki, Kang-Jun Liu, Naoto Inoue, and Kota Ya-
maguchi. Layerd: Decomposing raster graphic designs into
layers. In Proceedings of the IEEE/CVF International Con-
ference on Computer Vision, pages 17783–17792, 2025. 5,
6, 7, 8, 10
[37] Jianchao Tan, Jyh-Ming Lien, and Yotam Gingold. Decom-
posing digital paintings into layers via rgb-space geometry.
arXiv preprint arXiv:1509.03335, 2015. 5
[38] Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Fei
Chen, Steven McDonagh, Gerasimos Lampouras, Ignacio
Iacobacci, and Sarah Parisot. Mulan: A multi layer anno-
tated dataset for controllable text-to-image generation. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, pages 22413–22422, 2024. 5,
7
[39] Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai,
Xin Xia, Xuefeng Xiao, Weilin Huang, and Jianchao Yang.
Seededit 3.0: Fast and high-quality generative image editing.
arXiv preprint arXiv:2506.05083, 2025. 4
12