OmniPSD- Layered PSD Generation with Diffusion Transformer
如果无法正常显示,请先停止浏览器的去广告插件。
1. OmniPSD: Layered PSD Generation with Diffusion Transformer
arXiv:2512.09247v1 [cs.CV] 10 Dec 2025
Cheng Liu1, * Yiren Song1, * Haofan Wang2 Mike Zheng Shou1 †
1
National University of Singapore 2 Lovart AI
I want to generate a minimalist eco poster: deep teal
background, a cloud-ringed Earth centered, a light blue
wave on top, with a few plants and bubbles as accents.
Extract the text,
all foreground
elements and
background
from this poster.
Generating Hierarchical Captioning/Prompt
Rendering Text
“poster”: “Features a stylized representation of Earth surrounded
by clouds, …", “foreground”: "consists of a circular depiction of
Earth, prominently placed …”, “midground”: "includes a wavy, light
blue pattern that represents the ocean …”, “background”: "a solid
deep teal color that fills the entire poster. It serves … ”
Figure 1. OmniPSD is a Diffusion-Transformer framework that generates layered PSD files with transparent alpha channels. Our system
supports both Text-to-PSD multi-layer synthesis and Image-to-PSD reconstruction, producing editable layers that preserve structure, trans-
parency, and semantic consistency.
Abstract
Recent advances in diffusion models have greatly improved
image generation and editing, yet generating or recon-
structing layered PSD files with transparent alpha chan-
nels remains highly challenging. We propose OmniPSD,
a unified diffusion framework built upon the Flux ecosys-
tem that enables both text-to-PSD generation and image-
to-PSD decomposition through in-context learning. For
text-to-PSD generation, OmniPSD arranges multiple tar-
get layers spatially into a single canvas and learns their
compositional relationships through spatial attention, pro-
ducing semantically coherent and hierarchically structured
layers. For image-to-PSD decomposition, it performs iter-
ative in-context editing—progressively extracting and eras-
ing textual and foreground components—to reconstruct ed-
itable PSD layers from a single flattened image. An RGBA-
VAE is employed as an auxiliary representation module to
preserve transparency without affecting structure learning.
Extensive experiments on our new RGBA-layered dataset
demonstrate that OmniPSD achieves high-fidelity genera-
tion, structural consistency, and transparency awareness,
offering a new paradigm for layered design generation and
decomposition with diffusion transformers. Project page:
* Equal contribution.
† Corresponding author.
https://showlab.github.io/OmniPSD/.
1. Introduction
Layered design formats such as Photoshop (PSD) files
are essential in modern digital content creation, enabling
structured editing, compositional reasoning, and flexible
element-level manipulation. However, most generative
models today can only output flat raster images [20, 46, 52],
lacking the layer-wise structure and transparency informa-
tion that are crucial for professional design workflows.
To bridge this gap, we introduce OmniPSD, a unified
diffusion-based framework that supports both text-to-PSD
generation and image-to-PSD decomposition under a sin-
gle architecture. It enables bidirectional transformation be-
tween textual or visual inputs and fully editable, multi-layer
PSD graphics.
At the core of OmniPSD lies a pre-trained RGBA-
VAE, designed to encode and decode transparent images
into a latent space that preserves alpha-channel information
[24, 48, 74]. This RGBA-VAE serves as a shared founda-
tion across both sub-tasks. On top of it, we leverage the
Flux ecosystem, which consists of two complementary dif-
fusion transformer models: Flux-dev [69], a text-to-image
generator for creative synthesis, and Flux-Kontext [3], an
image editing model for in-context refinement and recon-
2. struction. By integrating these components, OmniPSD pro-
vides a unified, transparency-aware solution for both gener-
ation and decomposition.
(1) Text-to-PSD Generation. Given a textual description,
OmniPSD generates a layered PSD representation directly
from text. Instead of producing a single flat image, we spa-
tially arrange multiple semantic layers (e.g., background,
foreground, text, and effects) into a 2 × 2 grid, and generate
them simultaneously through the Flux-dev backbone. Each
generated layer is then decoded by the shared RGBA-VAE
to recover transparency and alpha information, producing
semantically coherent, editable, and compositional layers.
(2) Image-to-PSD Decomposition.
For reverse-
engineering real or synthetic posters into editable PSDs,
OmniPSD extends the Flux-Kontext model by replac-
ing its standard VAE with our pre-trained RGBA-VAE,
enabling transparency-aware reasoning in image editing.
The decomposition process is iterative: we first extract
text layers through in-context editing, then erase them
to reconstruct the clean background, and finally segment
and refine multiple foreground layers. All decomposed
outputs are in RGBA format, ensuring accurate transparent
boundaries and realistic compositional relationships.
By unifying generation and decomposition within a sin-
gle diffusion-transformer architecture, OmniPSD demon-
strates that both creative synthesis and structural reconstruc-
tion can be achieved under a transparency-aware, in-context
learning framework.
Our main contributions are summarized as follows:
• We present OmniPSD, a unified diffusion-based frame-
work that supports both text-to-PSD generation and
image-to-PSD decomposition within the same architec-
ture, bridging creative generation and analytical recon-
struction.
• We pre-train a transparency-preserving RGBA-VAE and
integrate it with Flux-dev and Flux-Kontext through in-
context learning, achieving high-fidelity image genera-
tion and reconstruction with accurate alpha-channel rep-
resentation.
• We construct a large-scale dataset with detailed RGBA
layer annotations and establish a new benchmark for ed-
itable PSD generation and decomposition. Extensive ex-
periments demonstrate the effectiveness and superiority
of our proposed approach.
2. Related Works
2.1. Diffusion Models
Diffusion probabilistic models have rapidly become the
dominant paradigm for high-fidelity image synthesis,
largely replacing GANs due to their stable training, strong
mode coverage, and ability to model complex data distri-
butions via reversed noising processes [17, 20, 55]. They
now underpin a broad range of visual tasks, including
text-to-image generation [58, 78–80], image editing [16,
26, 29, 59], and video synthesis [4, 21, 40–43, 52, 56].
To better support design and editing applications, subse-
quent work augments diffusion models with grounded con-
ditions and spatial controls—for example, grounded text-
to-image generation, conditional control branches, image-
prompt adapters, cross-attention-based prompt editing, self-
guided sampling, inpainting modules, as well as instance-
level and multi-subject layout control [6, 13, 18, 35, 62,
63, 72, 73, 76]—thereby improving layout consistency and
local editability. Early work predominantly relied on U-
Net-based denoisers in pixel or latent space, as popular-
ized by latent diffusion models such as Stable Diffusion
and SDXL [47, 52]. More recently, Transformer-based de-
noisers have become the de facto backbone, with Diffusion
Transformers (DiT) driving models like Stable Diffusion 3,
FLUX, HunyuanDiT, and PixArt, leveraging global atten-
tion and scalability to improve visual fidelity and prompt
alignment [8, 14, 36, 46, 69]. In parallel, flow-matching
and ODE-based formulations recast diffusion as learning
continuous deterministic flows between distributions, en-
abling more efficient sampling and deterministic trajecto-
ries [37, 38].
2.2. Layer-wise Image Generation
Layered image representations are fundamental to graph-
ics and design, as they enable element-wise editing, com-
positional reasoning, and asset reuse [32]. Early work
mainly decomposes a single image into depth layers, al-
pha mattes, or semantic regions under simplified fore-
ground–background assumptions [7, 33, 53, 68], which
helps matting and segmentation but falls short of the rich,
editable layer structures used in professional tools. With
diffusion models, newer methods explicitly target layered
generation. Some methods still rely on post-hoc detec-
tion, segmentation, and matting from a flat RGB output,
or adopt a two-stage “generate-then-decompose” pipeline,
where a composite RGB image is first synthesized and then
separated into foreground/background layers or RGBA in-
stances [15, 30, 39, 65, 77]. Such designs often accumulate
errors between stages and offer limited control over global
layout and inter-layer relationships.
More recent approaches generate multi-layer content di-
rectly in a diffusion framework. LayerDiff, LayerDif-
fuse, and LayerFusion explore layer-collaborative or multi-
branch architectures to jointly synthesize background and
multiple foreground RGBA layers while modeling oc-
clusion relationships [11, 25, 75]. ART and Layer-
Tracer further introduce region-based transformers and
vector-graphic decoders for variable multi-layer layouts and
object-level controllability [48, 57]. In parallel, multi-layer
datasets such as MuLAn provide high-quality RGBA anno-
3. Text Erasure
🔥
Object Erasure
Object Erasure
Text Prompt
GPT4
🔥
DiT
RGBA VAE
Hierarchical
Prompt
Legends
🔥 Learnable
Text Tokens
Noised Latent Tokens
Video Condi5on Tokens
(a) RGBA VAE Training
Render
Text
🔥
Condi5on
DiT
Block
Text Extrac5on
latent
Object
Extrac5on
Object
Extrac5on
(b) Image Edit Training
🔥
DiT
Block
latent
(c) Image Generate Training
(d) Image-to-PSD Inference
(e) Text-to-PSD Inference
Figure 2. OmniPSD overview. A unified Diffusion-Transformer with a shared RGBA-VAE enables both text-to-PSD layered generation
(left) and image-to-PSD decomposition (right). Text-to-PSD leverages spatial in-context learning with hierarchical captions, while Image-
to-PSD performs iterative flow-guided foreground extraction and background restoration. Our method produces fully editable PSD layers
with transparent alpha channels.
tations and occlusion labels to support controllable multi-
layer generation and editing [60]. Recent works like PSD-
iffusion explicitly harmonize layout and appearance across
foreground and background layers [24], and our method fol-
lows this line while additionally targeting PSD-style layer
structures and workflows tailored for poster and graphic de-
sign.
Orthogonal to transparent-layer modeling, another line
of work in automatic graphic design and poster generation
emphasizes layout- and template-level generation. COLE
and OpenCOLE propose hierarchical pipelines that decom-
pose graphic design into planning, layer-wise rendering,
and iterative editing [27, 28]. Graphist formulates hierar-
chical layout generation for multi-layer posters with a large
multimodal model that outputs structured JSON layouts for
design elements [10]. Visual Layout Composer introduces
an image–vector dual diffusion model that jointly gener-
ates raster backgrounds and vector elements for design lay-
outs [54]. MarkupDM and Desigen treat graphic docu-
ments as multimodal markup or controllable design tem-
plates, enabling completion and controllable template gen-
eration from partial specifications [31, 66]. PosterLLaVa
further leverages multimodal large language models to gen-
erate poster layouts and editable SVG designs from natural-
language instructions [71]. These systems focus on high-
level layout synthesis but typically output flattened renders
or coarse vector structures, whereas our approach targets
PSD-style RGBA layers with explicit alpha channels, mak-
ing the resulting assets directly editable and composable in
professional design tools.
2.3. RGBA Image Generation
Generating transparent or layered RGBA content is crucial
for compositing and design, yet has long been underex-
plored compared to standard RGB image synthesis. Tra-
ditional workflows typically rely on first generating opaque
RGB images and then applying separate matting, segmen-
tation, or alpha-estimation networks [7, 23, 33, 34, 53, 68],
which often leads to inconsistent boundaries, halo artifacts,
and limited control over transparency. Recent diffusion-
based methods begin to treat transparency as a first-class
signal. One representative line augments latent diffusion
models with “latent transparency”, learning an additional
latent offset that encodes alpha information while largely
preserving the original RGB latent manifold, so that exist-
ing text-to-image backbones can natively produce transpar-
ent sprites or multiple transparent layers without retraining
from scratch [75]. Building on this idea, RGBA-aware gen-
erators produce isolated transparent instances or sticker-like
assets that can be flexibly composed for graphic design and
poster layouts [15, 49].
Complementary work focuses on the representation side,
proposing unified RGBA autoencoders that extend pre-
trained RGB VAEs with dedicated alpha channels and in-
troducing benchmarks that adapt standard RGB metrics to
four-channel images via alpha compositing, thereby stan-
4. dardizing evaluation for RGBA reconstruction and gener-
ation [64]. Building on these ideas, multi-layer generation
systems increasingly adopt autoencoders that jointly encode
and decode stacked RGBA layers and couple them with dif-
fusion transformers that explicitly model transparency and
inter-layer effects [9, 11, 24, 48, 65, 70, 74], often trained
or evaluated on matting-centric multi-layer datasets such as
MAGICK and MuLAn [5, 60], yielding more accurate alpha
boundaries, coherent occlusions, and realistic soft shadows
in complex layered scenes.
3. Method
In this section, we first introduce the unified OmniPSD ar-
chitecture in Section 3.1. Next, Section 3.2 presents the
RGBA-VAE module, which enables alpha-aware latent rep-
resentation shared across both pathways. Then, Section 3.3
discusses the Image-to-PSD process based on iterative in-
context editing and structural decomposition. After that,
Section 3.4 describes the Text-to-PSD process, where lay-
ered compositions are generated via spatial in-context learn-
ing and cross-layer attention. Finally, Section 3.5 intro-
duces the Layered Poster Dataset.
3.1. Overall Architecture
Foreground Extraction
Text Erasure
Foreground Erasure
Text Erasure
(a) Image-to-PSD Reconstruction Dataset
{"poster": "The poster features a professional setting with a focus on
healthcare. It includes a laptop, medical tools, and abstract …...",
"foreground": "The foreground content is positioned in the top-right
corner of the poster. It features a close-up of a person's hands …...",
"midground": "The midground content is located in the bottom-left
corner of the poster. It includes a stethoscope and a clipboard, …...",
"background": "The background content consists of a soft beige color
with abstract shapes in light blue and peach tones. The …..."}
(b) Text-to-PSD Generation Dataset
Figure 3. OmniPSD’s layered dataset. Image-to-PSD is trained on
paired samples, while Text-to-PSD uses a 2 × 2 grid that presents
the full poster and its decomposed layers for in-context learning.
strong foundation for alpha-aware reconstruction, its pre-
training on limited natural transparency data causes severe
degradation when applied to design scenarios such as semi-
transparent text, shadow overlays, and soft blending effects.
To address this, we retrain the model on our curated dataset
of real-world design samples, enabling stable reconstruc-
tion of both alpha and color layers. We refer to this retrained
version as RGBA-VAE.
Following the formulation in the original AlphaVAE pa-
per, our training objective jointly optimizes pixel fidelity,
patch-level consistency, perceptual alignment, and latent
regularization as:
h
i
h
i
ˆ − ϕ(I)∥1
L = λpix E ∥Iˆ − I∥1 + λpatch E ∥ϕ(I)
h
i
ˆ − ψ(I)∥22
+ λperc E ∥ψ(I)
(1)
+ λKL KL(q(zRGB |·)∥p) + KL(q(zA |·)∥p) ,
We propose OmniPSD, a unified diffusion-based frame-
work designed to reconstruct and generate layered PSD
structures from either raster images or textual prompts. The
framework is built upon the Flux model family [3, 69],
combining Flux-Dev for text-to-image generation and Flux-
Kontext for image editing within an in-context learning
paradigm. At its core, a shared RGBA-VAE provides an
alpha-aware latent space, enabling consistent representation
of transparency and compositional hierarchy across both
generation and decomposition tasks.
Specifically, the Image-to-PSD branch iteratively de-
composes a given poster into text, foreground, and back-
ground layers through LoRA-based editing under the Flux-
Kontext backbone, ensuring accurate structural separation
with preserved alpha channels. In contrast, the Text-to-
PSD branch arranges layers spatially within a single gen-
eration canvas, where the model learns inter-layer relations
via spatial attention under the Flux-DEV backbone. To-
gether, these two pathways form a cohesive framework ca-
pable of bidirectional conversion between design images
and editable PSD layers, supported by our large-scale Lay-
ered Poster Dataset for training and evaluation.where I and Iˆ denote the ground-truth and reconstructed
images, respectively. ϕ(·) represents a patch-level feature
extractor enforcing local structure consistency, and ψ(·) de-
notes a perceptual encoder (e.g., VGG) that maintains se-
mantic fidelity. zRGB and zA correspond to the latent vari-
ables for color and alpha channels, and p is the Gaussian
prior. The coefficients λpix , λpatch , λperc , and λKL balance
pixel accuracy, local consistency, perceptual alignment, and
latent regularization, respectively.
This retraining procedure effectively bridges the gap be-
tween natural transparency modeling and design-layered
imagery. The resulting RGBA-VAE thus provides a shared
latent space for both our text-to-PSD and image-to-PSD
modules, enabling high-fidelity, alpha-preserving decom-
position and generation.
3.2. RGBA-VAE3.3. Image-to-PSD Reconstruction
To accurately represent transparency and compositional re-
lationships in layered design elements, we adopt and ex-
tend the AlphaVAE [64], a unified variational autoencoder
for RGBA image modeling. While AlphaVAE provides aWe formulate the Image-to-PSD reconstruction task as a
multi-step, iterative image-editing process, analogous to
how professional designers manually decompose visual el-
ements into layers in Photoshop. Instead of predicting all
5. layers in a single pass, we progressively extract text and
foreground objects, while recovering occluded background
content. Each step outputs an RGBA PNG layer with accu-
rate transparency. This iterative design ensures pixel-level
fidelity, precise alpha recovery, and structural composabil-
ity for final PSD reconstruction.
Concretely, we train two expert models: one special-
ized for foreground extraction and another for foreground
removal and background restoration. After each extraction,
the background-restoration model reconstructs clean back-
ground content, enabling the system to reveal deeper visual
layers over iterations. Through this alternating “extract-
foreground → erase-foreground” process, a flattened input
image is gradually decomposed into a stack of text, fore-
ground, and background layers suitable for PSD editing.
This pipeline is built on the Flux Kontext diffusion back-
bone with task-specific LoRA adapters. The decomposi-
tion process is formulated as a conditional flow-matching
problem, where the flattened image is treated as a condi-
tioning input and the model learns a deterministic flow field
that maps noisy latent states toward their target decomposed
layer representations.
Formulation. Let I0 ∈ RH×W ×4 denote the flattened
input poster image, and y ∈ {foreground, background}
denote the target layer type. We define latent variables
z0 = Eα (I0 ) and z1 = Eα (Iy ), where Eα is the RGBA-
VAE encoder. Flux models the continuous transformation
between z0 and z1 as a flow field vθ (zt , t | z0 ) governed by
an ODE:
dzt
= vθ (zt , t | z0 ), t ∈ [0, 1],
(2)
dt
where zt = (1 − t)z0 + tz1 represents intermediate latent
states.
The training objective follows the standard Flow Match-
ing Loss [37, 69]:
2
Lflow = Et∼U (0,1),(z0 ,z1 ) ∥vθ (zt , t | z0 ) − (z1 − z0 )∥2 ,
(3)
which enforces the learned flow field to align with the true
displacement between input and target latents. This for-
mulation avoids stochastic noise injection, leading to faster
convergence and deterministic inference.
Foreground Extraction Model. Given I0 , the model de-
tects salient regions and generates RGBA layers for each
foreground instance. Each LoRA adapter is trained on
triplets (I0 , m, Ifg ), where m denotes a binary or bounding-
box mask, and Ifg is the corresponding RGBA foreground
target. Both conditional and target images are encoded into
latent sequences:
zcond = Eα (I0 ),
ztarget = Eα (Ifg ),
(4)
then concatenated into a unified token sequence:
Z = [zcond ; ztarget ].
(5)
The transformer backbone applies Multi-Modal Atten-
tion (MMA) [45] with bidirectional context:
QK ⊤
V,
(6)
Z′ = MMA(Z) = Softmax √
d
capturing pixel-level and semantic correlations between
input and decomposed regions.
Foreground Erasure Model. After extraction, we em-
ploy an erasure module trained to reconstruct occlusion-free
backgrounds Ibg given the same condition I0 and mask m.
At each iteration k, the model removes the current fore-
(k)
ground, restores the occluded background Ibg , and stores
(k)
the removed content Ifg as an independent RGBA layer:
(1)
(K)
{Ifg , . . . , Ifg , Ibg } → PSD Stack.
(7)
All LoRA modules share the same latent flow space of
Flux Kontext, ensuring modular composability across text
removal, object extraction, and background inpainting sub-
tasks.
Editable Text Layer Recovery. To transform rasterized
text regions into editable design layers, we reconstruct
vector-text through a unified OCR–font-recovery–rendering
pipeline. We detect and recognize textual content from
pixel-level inputs using a transformer-based OCR mod-
ule, implemented via the open-source PaddleOCR toolkit
[2], which provides state-of-the-art scene and document-
text recognition with multilingual and layout-aware sup-
port. The recognized text regions are then associated
with the most plausible typeface from a curated font bank
through semantic font embedding retrieval, achieved using
the lightweight font classify system [1], which enables effi-
cient deep-learning-based font matching across large-scale
font libraries. The recovered text content together with
its inferred font attributes is subsequently re-rendered as
resolution-independent vector layers, yielding editable PSD
text objects that faithfully preserve the original typography
and layout structure.
3.4. Text-to-PSD Generation
While Image-to-PSD is highly effective at decomposing an
existing image into layered RGBA components, in many
real scenarios no reference image is available. Instead, users
may wish to generate a fully layered PSD file directly from
textual descriptions. To meet this need, we introduce the
Text-to-PSD model, which leverages hierarchical textual
prompts, cross-modal feature alignment, and an in-context
layer reasoning mechanism.
In-Context Layer Reasoning via a 2×2 Grid. Our key
idea is to enable different layers to “see” each other without
modifying the backbone or introducing explicit cross-layer
attention modules [61]. We arrange four images—the full
poster Ifull , foreground Ifg , middle-ground Imid , and back-
6. ground Ibg —into a 2×2 grid:
I
G = full
Imid
Ifg
.
Ibg
This grid serves as an in-context visual canvas, enabling
the model’s native spatial attention to implicitly learn layer
relationships such as layout consistency, occlusion order-
ing, color harmony, and transparency boundaries. During
inference, the model generates all PSD layers jointly in a
single pass.
Hierarchical Text Prompts. To provide structured seman-
tic grounding, we annotate each sample with a JSON record
that assigns a dedicated description to the full poster and
each semantic layer, e.g., {"poster": "...",
"foreground": "...", "midground":
"...", "background": "..." }.
Here,
poster captures the global scene, while the remain-
ing fields describe the corresponding layers.
Grid Spatial In-Context Learning. The 2×2 grid G is en-
coded by the RGBA-VAE and processed by the DiT back-
bone in a single forward pass. Spatial self-attention over
this grid lets layer tokens attend to the full-poster tokens, so
the model learns cross-layer correspondences and composi-
tional relationships without any extra cross-layer modules.
Training Objective. We retain the standard flow-matching
objective of the diffusion transformer and introduce no ad-
ditional losses, allowing the model to learn layered seman-
tics purely from the hierarchical prompts and the in-context
2 × 2 grid formulation.
3.5. Dataset Construction
To support training and evaluation, we construct the Lay-
ered Poster Dataset, comprising over 200,000 real PSD
files collected from online design repositories. These files
are manually authored by professional designers and con-
tain rich semantic groupings, font layers, shape groups,
and effect overlays. We perform automated parsing to ex-
tract group-level and layer-level metadata, then apply post-
filtering to retain only PSDs with valid RGBA structure.
Each sample is annotated into structured groups—text, fore-
ground, background—with each layer saved as an RGBA
png and associated with editable metadata (e.g., bounding
box, visibility, stacking order).
To further support training across different subtasks, we
organize the data with task-specific structures. For the Text-
to-PSD generation task, we intentionally remove all text
layers during dataset construction, since text should be ren-
dered last rather than generated. This preserves authentic
typography, font fidelity, and editability. The data is ar-
ranged in a four-panel grid: the top-left contains the full
poster, while the remaining three panels provide semantic
decomposition—top-right: foreground layer 1, bottom-left:
foreground layer 2, and bottom-right: background layer.
This format encourages the model to learn how text con-
ditions map to layered design structures.
For the Image-to-PSD task, we adopt a triplet data strat-
egy that mirrors the iterative layer editing process at infer-
ence time. Each triplet consists of (i) an input image, (ii)
the extracted foreground content, and (iii) the correspond-
ing background after foreground removal. This setup simu-
lates the step-by-step editing workflow used in practical de-
sign software—first isolating editable regions, then erasing
them from the scene—enabling the model to learn realistic
PSD-style layer decomposition and reconstruction.
4. Experiments
4.1. Experiment Details.
Experiment Details. During the Text-to-PSD training,
we employed the Flux 1.0 dev model [69] built upon the
pretrained Diffusion Transformer (DiT) architecture. The
training resolution was set to 1024×1024 with a 2×2 grid
layout. We adopted the LoRA fine-tuning strategy [22] with
a LoRA rank of 128, a batch size of 8, a learning rate of
0.001, and 30,000 fine-tuning steps.
For the Image-to-PSD model training, we fine-tuned
LoRA adapters on the Flux Kontext backbone [3] at a res-
olution of 1024×1024. Specifically, we separately trained
two types of modules—foreground extraction (for text and
non-text elements) and foreground erasure (for text and
non-text elements)—each for 30,000 steps. For tasks that
require transparency channels (e.g., Text-to-PSD, text ex-
traction, and object extraction), we used the RGBA-VAE as
the variational autoencoder. For other tasks without trans-
parency needs, we used the original VAE backbone.
Baseline Methods. For the Text-to-PSD task, we bench-
mark against LayerDiffuse [74] and GPT-Image-1 [44], the
most relevant publicly available layered poster generation
systems. For the Image-to-PSD task, to the best of our
knowledge, this is the first work enabling editable PSD re-
construction from a single flattened image, and thus no prior
method exists for direct comparison. Thus, we evaluate sev-
eral commercial systems capable of producing RGBA lay-
ers [44], as well as a non-RGBA baseline [3, 12] where fore-
grounds are generated on a white canvas and transparency
masks are derived using SAM2 segmentation [51], repre-
senting a proxy solution without alpha-aware modeling.
Metrics. We evaluate OmniPSD using four metrics. FID
[19] is computed on each generated layer and composite
output to measure visual realism. For the Text-to-PSD
task, we report layer-wise CLIP Score [50] to assess se-
mantic alignment between each generated layer and its tex-
tual prompt. For the Image-to-PSD task, we compute re-
construction MSE by re-compositing predicted layers into
a flattened image and measuring pixel error against the in-
put. Together, these metrics capture realism, semantic fi-
7. (a) Image-to-PSD Reconstruction
{"poster": "The poster features a serene landscape scene ...", "foreground": "The foreground
content includes various green botanical ...", "midground": “Showcases a stunning view of
mountains ...", "background": “Consists of a textured green pattern that fills remaining ..."}{"poster": "The poster radiates a cheerful, adventurous summer theme, …", "foreground":
"there is a silhouette of several domed and towered …", "midground": "Colorful hanging flags
in white, green, and ...", "background": "The background presents a glowing golden ..."}
{"poster": "The poster features a romantic theme with a couple in love ...", "foreground":
“Showcases a circular image of a couple, positioned centrally ...", "midground": “Consists of
various abstract shapes...", "background": “Features a smooth gradient transitioning from ..."}{“poster”: “The poster features a nature-themed design ...", "foreground": "The foreground
content is centered in the poster, featuring a pair of hands ...", "midground": “Consists of
abstract green shapes ...", "background": “Features a textured, crumpled paper design in ..."}
(b) Text-to-PSD Generation
Figure 4. Generation results of OmniPSD. (a) Image-to-PSD reconstruction decomposes an input poster into editable text layers, mul-
tiple foreground layers, and a clean background layer. (b) Text-to-PSD synthesis uses hierarchical captions to generate background and
foreground layers, followed by rendering the corresponding editable text layers.
delity, structural coherence, and reconstruction accuracy.
To evaluate cross-layer structure and layout coherence, we
employ GPT-4 [44] as a vision-language judge, scoring spa-
tial arrangement and design consistency. The detailed GPT-
4 score metrics are provided in the supplementary materials
A.
Benchmarks. For the Text-to-PSD task, we prepare a test
set of 500 layer-aware prompts (two foreground, one back-
ground, and one global layout description), all derived from
real PSD files to ensure realistic evaluation. For the Image-
to-PSD task, we curate 500 real PSD files as the test set,
which are flattened into single images for evaluating PSD
reconstruction quality.
User Study. We conducted a user study with 18 participants
to evaluate the usability and perceptual quality of the layers
generated by OmniPSD. The detailed study procedures and
results are provided in the supplementary materials B.
8. Poster
Foreground Background
Poster
Original
Poster
Foreground Background
LayerDiffuse
SDXL
Reconstructed
Text
Poster
Extraction
Text
Erasure
Foreground Foreground
Extraction
Erasure
Kontext &
Segmentation
Nano-Banana &
Segmentation
GPT-
Image-1
GPT-Image-1
OmniPSD
OmniPSD
(a) Text-to-PSD Comparative Results
(b) Image-to-PSD Comparative Results
Figure 5. Compare with baselines on text-to-PSD and image-to-PSD. OmniPSD matches the visual quality of leading diffusion and vision-
language models while uniquely supporting multi-layer PSD generation with transparent alpha channels. Compared to existing layered
synthesis baselines, it achieves clearly superior visual fidelity and more coherent, logically structured layers.
Table 1. Image-to-PSD generation results across methods. Lower
is better for MSE; higher is better for PSNR, SSIM, CLIP-I (CLIP
image score), and GPT-4-score. Bold numbers indicate the best
performance for each metric.
MethodMSE ↓ PSNR ↑ SSIM ↑ CLIP-I ↑ GPT-4-score ↑
Kontext [3]
Nano-Banana [12]
GPT-Image-1 [44]
OmniPSD (ours)1.10e-1
2.06e-2
2.48e-2
1.14e-3
9.59
16.9
16.1
24.0
0.653
0.816
0.761
0.952
0.692
0.916
0.837
0.959
0.64
0.86
0.84
0.92
Table 2. Text-to-PSD generation results across methods. Lower is
better for FID; higher is better for CLIP and GPT-4 scores. Bold
numbers indicate the best performance for each metric.
MethodFID ↓CLIP Score ↑GPT-4 Score ↑
LayerDiffuse SDXL [74]
GPT-Image-1 [44]
OmniPSD (ours)89.35
53.21
30.4324.78
35.59
37.640.66
0.84
0.90
Table 3. Image-to-PSD evaluation. Lower is better for FID and
MSE; higher is better for PSNR and GPT-4 scores. We evaluate
two sub-tasks—foreground extraction and foreground erasure—as
well as the full reconstruction pipeline.
TaskFID ↓MSE ↓PSNR ↑GPT-4 Score ↑
Text Extraction
Text Erasure
Foreground Extraction
Foreground Erasure11.42
19.38
33.35
27.141.34e-3
1.15e-3
2.26e-3
2.13e-326.86
26.37
19.27
29.410.86
0.94
0.84
0.92
Full Image-to-PSD24.711.14e-323.980.90
produces plausible foregrounds and layouts but unstable,
artifact-prone backgrounds, while GPT-Image-1, despite
strong visual quality, often loses or alters background el-
ements, harming global consistency. OmniPSD, by con-
trast, yields high-quality foreground and background lay-
ers with coherent overall posters. For image-to-PSD, base-
lines do not output true RGBA layers and thus cannot pro-
vide checkerboard visualizations. OmniPSD accurately per-
forms text extraction, foreground extraction/removal, and
background reconstruction, whereas other methods struggle
to recover text and maintain consistency between extracted
and erased regions, limiting their usability for PSD-style
editing.
Quantitative Evaluation. This section presents quantita-
tive analysis results. Table 1 and 2 summarize the com-
parison results. Table 3 further reports the performance
of each component in the image-to-PSD pipeline. Com-
pared with strong baselines, OmniPSD achieves visual gen-
eration quality on par with state-of-the-art large diffusion
and vision-language models. More importantly, our method
uniquely supports multi-layer PSD generation with trans-
parent alpha channels, a capability that existing approaches
are far from achieving. Relative to prior layered synthe-
sis systems, OmniPSD also demonstrates significant ad-
vantages in visual fidelity, semantic coherence, and logical
layer structure, producing clean, editable layers that better
reflect real design workflows.
4.3. Ablation Study.
4.2. Comparison and Evaluation
Qualitative Evaluation. Figure 4 and 5 show the quali-
tative comparison. For text-to-PSD, LayerDiffuser-SDXL
In this section, we present a detailed ablation study. We
first compare our RGBA-VAE with other VAEs capable of
encoding and decoding alpha channels. As shown in Table 4
and 6, models trained primarily on natural images perform
9. Table 4. RGBA reconstruction results. Lower is better for MSE
and LPIPS; higher is better for PSNR and SSIM. Bold numbers
indicate the best performance for each metric.
MethodMSE ↓PSNR ↑SSIM ↑LPIPS ↓
LayerDiffuse VAE [74]
Red-VAE [67]
Alpha-VAE [64]
RGBA-VAE (ours)2.54e-1
2.52e-1
4.15e-3
9.82e-48.06
8.53
26.9
32.50.289
0.300
0.739
0.9450.473
0.451
0.120
0.0348
Table 5. Ablation study results in Text-to-PSD task.
MethodFID ↓CLIP Score ↑GPT-4 Score ↑
w/o layer-specific prompt
OmniPSD full38.56
30.4334.31
37.640.78
0.90
poorly in the design-poster setting, exhibiting inconsistent
reconstruction, noticeable artifacts, and blurred text. Table
5 further highlights the importance of structured, layer-wise
prompts in the text-to-PSD task: when using naive prompts,
the generation quality degrades significantly.
Ground Truth LayerDiffuse VAE
Red-VAE
Alpha-VAE
RGBA-VAE
Figure 6. OmniPSD’s RGBA-VAE. Compared to existing VAE
methods which compatible with image alpha channels.
5. Conclusion
In this paper, we present OmniPSD, a unified framework for
layered and transparency-aware PSD generation from a sin-
gle raster image. Built upon a Diffusion Transformer back-
bone, OmniPSD decomposes complex poster-style images
into structured RGBA layers through an iterative, in-context
editing process. Our framework integrates an RGBA-
VAE for alpha-preserving representation and multiple task-
specific Kontext-LoRA modules for text, object, and back-
ground reconstruction. We further construct a large-scale,
professionally annotated layered dataset to support train-
ing and evaluation. Extensive experiments demonstrate
that OmniPSD achieves superior structural fidelity, trans-
parency modeling, and semantic consistency, establishing a
new paradigm for design-aware image decomposition and
editable PSD reconstruction.
References
[1] Storia AI. font-classify: Lightweight deep-learning-based
font recognition. https://github.com/Storia-
AI/font-classify, 2024. Accessed: 2025-03-10. 5
[2] PaddlePaddle Authors.
Paddleocr: Open-source ocr
toolkit. https://github.com/PaddlePaddle/
PaddleOCR, 2023. Accessed: 2025-03-10. 5
[3] Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak-
sham Consul, Cyril Diagne, Tim Dockhorn, Jack English,
Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1
kontext: Flow matching for in-context image generation and
editing in latent space. arXiv e-prints, pages arXiv–2506,
2025. 1, 4, 6, 8
[4] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel
Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi,
Zion English, Vikram Voleti, Adam Letts, et al. Stable video
diffusion: Scaling latent video diffusion models to large
datasets. arXiv preprint arXiv:2311.15127, 2023. 2
[5] Ryan D. Burgert, Brian L. Price, Jason Kuen, Yijun Li, and
Michael S. Ryoo. Magick: A large-scale captioned dataset
from matting generated images using chroma keying. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, 2024. 4
[6] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi-
aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu-
tual self-attention control for consistent image synthesis and
editing. In Proceedings of the IEEE/CVF International Con-
ference on Computer Vision, 2023. 2
[7] Guowei Chen, Yi Liu, Jian Wang, Juncai Peng, Yuying Hao,
Lutao Chu, Shiyu Tang, Zewu Wu, Zeyu Chen, Zhiliang Yu,
Yuning Du, Qingqing Dang, Xiaoguang Hu, and Dianhai Yu.
Pp-matting: High-accuracy natural image matting. arXiv
preprint arXiv:2204.09433, 2022. 2, 3
[8] Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze
Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo,
Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif-
fusion transformer for photorealistic text-to-image synthesis.
arXiv preprint arXiv:2310.00426, 2023. 2
[9] Xuewei Chen, Zhimin Chen, and Yiren Song. Transanimate:
Taming layer diffusion to generate rgba video. arXiv preprint
arXiv:2503.17934, 2025. 4
[10] Yutao Cheng, Zhao Zhang, Maoke Yang, Hui Nie, Chunyuan
Li, Xinglong Wu, and Jie Shao. Graphic design with large
multimodal model. In Proceedings of the AAAI Conference
on Artificial Intelligence, 2025. 3
[11] Yusuf Dalva, Yijun Li, Qing Liu, Nanxuan Zhao, Jianming
Zhang, Zhe Lin, and Pinar Yanardag. Layerfusion: Harmo-
nized multi-layer text-to-image generation with generative
priors. arXiv preprint arXiv:2412.04460, 2024. 2, 4
[12] Google DeepMind. Nano-banana (gemini 2.5 flash im-
age): Google deepmind’s image generation and editing
model. https://aistudio.google.com/models/
gemini-2-5-flash-image, 2025. Accessed: 2025-
11-14. 6, 8
[13] Dave Epstein, Allan Jabri, Ben Poole, Alexei A. Efros, and
Aleksander Holynski. Diffusion self-guidance for control-
10. lable image generation. Advances in Neural Information
Processing Systems, 36, 2023. 2
[14] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim
Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik
Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim
Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan-
nik Marek, and Robin Rombach. Scaling rectified flow trans-
formers for high-resolution image synthesis. arXiv preprint
arXiv:2403.03206, 2024. 2
[15] Alessandro Fontanella, Petru-Daniel Tudosiu, Yongxin
Yang, Shifeng Zhang, and Sarah Parisot. Generating com-
positional scenes via text-to-image rgba instance generation.
Advances in Neural Information Processing Systems, 2024.
2, 3
[16] Yan Gong, Yiren Song, Yicheng Li, Chenglin Li, and Yin
Zhang. Relationadapter: Learning and transferring vi-
sual relation with diffusion transformers. arXiv preprint
arXiv:2506.02528, 2025. 2
[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial networks. Commu-
nications of the ACM, 63(11):139–144, 2020. 2
[18] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image
editing with cross-attention control. In Proceedings of the In-
ternational Conference on Learning Representations, 2023.
2
[19] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
two time-scale update rule converge to a local nash equilib-
rium. In Advances in Neural Information Processing Sys-
tems, 2017. 6
[20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-
fusion probabilistic models. Advances in neural information
processing systems, 33:6840–6851, 2020. 1, 2
[21] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William
Chan, Mohammad Norouzi, and David J Fleet. Video dif-
fusion models. Advances in neural information processing
systems, 35:8633–8646, 2022. 2
[22] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al.
Lora: Low-rank adaptation of large language models. ICLR,
1(2):3, 2022. 6
[23] Xiaobin Hu, Xu Peng, Donghao Luo, Xiaozhong Ji, Jin-
long Peng, Zhengkai Jiang, Jiangning Zhang, Taisong Jin,
Chengjie Wang, and Rongrong Ji. Diffumatting: Synthe-
sizing arbitrary objects with matting-level annotation. arXiv
preprint arXiv:2403.06168, 2024. 3
[24] Dingbang Huang, Wenbo Li, Yifei Zhao, Xinyu Pan, Yan-
hong Zeng, and Bo Dai. Psdiffusion: Harmonized multi-
layer image generation via layout and appearance alignment.
arXiv preprint arXiv:2505.11468, 2025. 1, 3, 4
[25] Runhui Huang, Kaixin Cai, Jianhua Han, Xiaodan Liang,
Wei Zhang, Songcen Xu, and Hang Xu. Layerdiff: Ex-
ploring text-guided multi-layered composable image synthe-
sis via layer-collaborative diffusion model. arXiv preprint
arXiv:2403.12036, 2024. 2
[26] Shijie Huang, Yiren Song, Yuxuan Zhang, Hailong Guo,
Xueyin Wang, and Jiaming Liu. Arteditor: Learning cus-
tomized instructional image editor from few-shot examples.
In Proceedings of the IEEE/CVF International Conference
on Computer Vision, pages 17651–17662, 2025. 2
[27] Naoto Inoue, Kento Masui, Wataru Shimoda, and Kota Yam-
aguchi. Opencole: Towards reproducible automatic graphic
design generation. arXiv preprint arXiv:2406.08232, 2024.
3
[28] Peidong Jia, Chenxuan Li, Zeyu Liu, Yichao Shen, Xingru
Chen, Yuhui Yuan, Yinglin Zheng, Dong Chen, Ji Li, Xi-
aodong Xie, Shanghang Zhang, and Baining Guo. Cole: A
hierarchical generation framework for graphic design. arXiv
preprint arXiv:2311.16974, 2023. 3
[29] Yuxin Jiang, Yuchao Gu, Yiren Song, Ivor Tsang, and
Mike Zheng Shou. Personalized vision via visual in-context
learning. arXiv preprint arXiv:2509.25172, 2025. 2
[30] Kyoungkook Kang, Gyujin Sim, Geonung Kim, Donguk
Kim, Seungho Nam, and Sunghyun Cho. Layeringdiff: Lay-
ered image synthesis via generation, then disassembly with
generative knowledge. arXiv preprint arXiv:2501.01197,
2025. 2
[31] Kotaro Kikuchi, Ukyo Honda, Naoto Inoue, Mayu Otani,
Edgar Simo-Serra, and Kota Yamaguchi.
Multimodal
markup document models for graphic design completion. In
Proceedings of the ACM International Conference on Multi-
media, 2025. 3
[32] Wei-Cheng Lee, Chih-Peng Chang, Wen-Hsiao Peng, and
Hsueh-Ming Hang. A hybrid layered image compressor with
deep-learning technique. In IEEE International Workshop on
Multimedia Signal Processing (MMSP), 2020. 2
[33] Jizhizi Li, Jing Zhang, Stephen J. Maybank, and Dacheng
Tao. Bridging composite and real: Towards end-to-end deep
image matting. International Journal of Computer Vision,
130(2):246–266, 2022. 2, 3
[34] Jiachen Li, Jitesh Jain, and Humphrey Shi. Matting anything.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition Workshops, pages 1775–
1785, 2024. 3
[35] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian-
wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee.
Gligen: Open-set grounded text-to-image generation. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, 2023. 2
[36] Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong,
Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu,
Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jia-
hao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang
Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan
Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng
Fang, Weiyan Wang, Jinbao Xue, Yangyu Tao, Jianchen Zhu,
Kai Liu, Sihuan Lin, Yifu Sun, Yun Li, Dongdong Wang,
Mingtao Chen, Zhichao Hu, Xiao Xiao, Yan Chen, Yuhong
Liu, Wei Liu, Di Wang, Yong Yang, Jie Jiang, and Qinglin
Lu. Hunyuan-dit: A powerful multi-resolution diffusion
transformer with fine-grained chinese understanding. arXiv
preprint arXiv:2405.08748, 2024. 2
11. [37] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil-
ian Nickel, and Matt Le. Flow matching for generative mod-
eling. arXiv preprint arXiv:2210.02747, 2022. 2, 5
[38] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow
straight and fast: Learning to generate and transfer data with
rectified flow. arXiv preprint arXiv:2209.03003, 2022. 2
[39] Yu-Lun Liu, Wei-Sheng Lai, Ming-Hsuan Yang, Yung-Yu
Chuang, and Jia-Bin Huang. Learning to see through ob-
structions with layered decomposition. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2020. 2
[40] Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran
Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose-
guided text-to-video generation using pose-free videos. In
Proceedings of the AAAI Conference on Artificial Intelli-
gence, pages 4117–4125, 2024. 2
[41] Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing
He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung
Shum, Wei Liu, et al. Follow-your-emoji: Fine-controllable
and expressive freestyle portrait animation. In SIGGRAPH
Asia 2024 Conference Papers, pages 1–12, 2024.
[42] Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Leqi
Shen, Chenyang Qi, Jixuan Ying, Chengfei Cai, Zhifeng Li,
Heung-Yeung Shum, et al. Follow-your-click: Open-domain
regional image animation via motion prompts. In Proceed-
ings of the AAAI Conference on Artificial Intelligence, pages
6018–6026, 2025.
[43] Yue Ma, Yulong Liu, Qiyuan Zhu, Ayden Yang, Kunyu Feng,
Xinhua Zhang, Zhifeng Li, Sirui Han, Chenyang Qi, and
Qifeng Chen. Follow-your-motion: Video motion transfer
via efficient spatial-temporal decoupled finetuning. arXiv
preprint arXiv:2506.05207, 2025. 2
[44] OpenAI.
Gpt-4 technical report.
arXiv preprint
arXiv:2303.08774, 2023. 6, 7, 8
[45] Zexu Pan, Zhaojie Luo, Jichen Yang, and Haizhou Li. Multi-
modal attention for speech emotion recognition. arXiv
preprint arXiv:2009.04107, 2020. 5
[46] William Peebles and Saining Xie. Scalable diffusion models
with transformers. In Proceedings of the IEEE/CVF inter-
national conference on computer vision, pages 4195–4205,
2023. 1, 2
[47] Dustin Podell, Zion English, Kyle Lacey, Andreas
Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and
Robin Rombach. Sdxl: Improving latent diffusion mod-
els for high-resolution image synthesis. arXiv preprint
arXiv:2307.01952, 2023. 2
[48] Yifan Pu, Yiming Zhao, Zhicong Tang, Ruihong Yin, Haox-
ing Ye, Yuhui Yuan, Dong Chen, Jianmin Bao, Sirui Zhang,
Yanbin Wang, et al. Art: Anonymous region transformer for
variable multi-layer transparent image generation. In Pro-
ceedings of the Computer Vision and Pattern Recognition
Conference, pages 7952–7962, 2025. 1, 2, 4
[49] Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, and Rita
Cucchiara. Alfie: Democratising rgba image generation with
no $$$. arXiv preprint arXiv:2408.14826, 2024. 3
[50] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, et al. Learning transfer-
able visual models from natural language supervision. In
International Conference on Machine Learning, 2021. 6
[51] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang
Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman
Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt-
ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-
Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feicht-
enhofer. Sam 2: Segment anything in images and videos.
arXiv preprint arXiv:2408.00714, 2024. 6
[52] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models. In Proceedings of
the IEEE/CVF conference on computer vision and pattern
recognition, pages 10684–10695, 2022. 1, 2
[53] Soumyadip Sengupta, Vivek Jayaram, Brian Curless,
Steven M Seitz, and Ira Kemelmacher-Shlizerman. Back-
ground matting: The world is your green screen. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 2291–2300, 2020. 2, 3
[54] Mohammad Amin Shabani, Zhaowen Wang, Difan Liu,
Nanxuan Zhao, Jimei Yang, and Yasutaka Furukawa. Vi-
sual layout composer: Image-vector dual diffusion model for
design layout generation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
2024. 3
[55] Jiaming Song, Chenlin Meng, and Stefano Ermon.
Denoising diffusion implicit models.
arXiv preprint
arXiv:2010.02502, 2020. 2
[56] Yiren Song, Shijie Huang, Chen Yao, Xiaojun Ye, Hai Ci,
Jiaming Liu, Yuxuan Zhang, and Mike Zheng Shou. Proces-
spainter: Learn painting process from sequence data. arXiv
preprint arXiv:2406.06062, 2024. 2
[57] Yiren Song, Danze Chen, and Mike Zheng Shou. Layer-
tracer: Cognitive-aligned layered svg synthesis via diffusion
transformer. arXiv preprint arXiv:2502.01105, 2025. 2
[58] Yiren Song, Cheng Liu, and Mike Zheng Shou. Makeany-
thing:
Harnessing diffusion transformers for multi-
domain procedural sequence generation. arXiv preprint
arXiv:2502.01572, 2025. 2
[59] Yiren Song, Cheng Liu, and Mike Zheng Shou. Omniconsis-
tency: Learning style-agnostic consistency from paired styl-
ization data. arXiv preprint arXiv:2505.18445, 2025. 2
[60] Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Fei
Chen, Steven McDonagh, Gerasimos Lampouras, Ignacio
Iacobacci, and Sarah Parisot. Mulan: A multi layer anno-
tated dataset for controllable text-to-image generation. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, 2024. 3, 4
[61] Cong Wan, Xiangyang Luo, Zijian Cai, Yiren Song, Yunlong
Zhao, Yifan Bai, Yuhang He, and Yihong Gong. Grid: Visual
layout generation. arXiv preprint arXiv:2412.10718, 2024.
5
[62] Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Ro-
hit Girdhar, and Ishan Misra. Instancediffusion: Instance-
level control for image generation. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2024. 2
12. [63] Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and
Hao Jiang. Ms-diffusion: Multi-subject zero-shot image per-
sonalization with layout guidance. In Proceedings of the In-
ternational Conference on Learning Representations, 2025.
2
[64] Zile Wang, Hao Yu, Jiabo Zhan, and Chun Yuan. Alphavae:
Unified end-to-end rgba image reconstruction and genera-
tion with alpha-aware representation learning. arXiv preprint
arXiv:2507.09308, 2025. 4, 9
[65] Zitong Wang, Hang Zhao, Qianyu Zhou, Xuequan Lu, Xi-
angtai Li, and Yiren Song. Diffdecompose: Layer-wise de-
composition of alpha-composited images via diffusion trans-
formers. arXiv preprint arXiv:2505.21541, 2025. 2, 4
[66] Haohan Weng, Danqing Huang, Yu Qiao, Zheng Hu, Chin-
Yew Lin, Tong Zhang, and C. L. Philip Chen. Desigen: A
pipeline for controllable design template generation. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, 2024. 3
[67] Qiang Xiang and Shuang Sun. Layerdiffuse-flux: Flux
version implementation of layerdiffusion.
https :
/ / github . com / FireRedTeam / LayerDiffuse -
Flux, 2025. Code repository, accessed 2025-11-13. 9
[68] Ning Xu, Brian Price, Scott Cohen, and Thomas Huang.
Deep image matting. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 2970–
2979, 2017. 2, 3
[69] Chenglin Yang, Celong Liu, Xueqing Deng, Dongwon Kim,
Xing Mei, Xiaohui Shen, and Liang-Chieh Chen. 1.58-bit
flux. arXiv preprint arXiv:2412.18653, 2024. 1, 2, 4, 5, 6
[70] Jinrui Yang, Qing Liu, Yijun Li, Soo Ye Kim, Daniil Pakho-
mov, Mengwei Ren, Jianming Zhang, Zhe Lin, Cihang Xie,
and Yuyin Zhou. Generative image layer decomposition with
visual effects. arXiv preprint arXiv:2411.17864, 2024. 4
[71] Tao Yang, Yingmin Luo, Zhongang Qi, Yang Wu, Ying Shan,
and Chang Wen Chen. Posterllava: Constructing a uni-
fied multi-modal layout generator with llm. arXiv preprint
arXiv:2406.02884, 2024. 3
[72] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-
adapter: Text compatible image prompt adapter for text-to-
image diffusion models. arXiv preprint arXiv:2308.06721,
2023. 2
[73] Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin
Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything:
Segment anything meets image inpainting. arXiv preprint
arXiv:2304.06790, 2023. 2
[74] Lvmin Zhang and Richard Zhang. Transparent image
layer diffusion using latent transparency. arXiv preprint
arXiv:2402.17113, 2024. 1, 4, 6, 8, 9
[75] Lvmin Zhang and Richard Zhang. Transparent image
layer diffusion using latent transparency. arXiv preprint
arXiv:2402.17113, 2024. 2, 3
[76] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
conditional control to text-to-image diffusion models. In
Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 3836–3847, 2023. 2
[77] Xinyang Zhang, Wentian Zhao, Xin Lu, and Jeff Chien.
Text2layer: Layered image generation using latent diffusion
model. arXiv preprint arXiv:2307.09781, 2023. 2
[78] Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng
Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al.
Ssr-encoder: Encoding selective subject representation for
subject-driven generation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 8069–8078, 2024. 2
[79] Yuxuan Zhang, Lifu Wei, Qing Zhang, Yiren Song, Jiaming
Liu, Huaxia Li, Xu Tang, Yao Hu, and Haibo Zhao. Stable-
makeup: When real-world makeup transfer meets diffusion
model. arXiv preprint arXiv:2403.07764, 2024.
[80] Yuxuan Zhang, Qing Zhang, Yiren Song, Jichao Zhang, Hao
Tang, and Jiaming Liu. Stable-hair: Real-world hair transfer
via diffusion model. In Proceedings of the AAAI Conference
on Artificial Intelligence, pages 10348–10356, 2025. 2
13. Supplementary
In the supplementary material, we provide additional de-
tails on the GPT-4-based automatic evaluation protocol, de-
scribe the design and results of our user study in both text-
to-PSD and image-to-PSD settings, present more qualita-
tive examples of OmniPSD’s layered poster generation and
reconstruction, and showcase the interactive user interface
and typical editing workflows supported by our system.
the full composited poster, the text layer, the foreground
layer(s), and the background layer, and assign a single in-
teger score in {1,2,3,4,5} based on visual consistency be-
tween layers, plausibility of occlusion and depth, and read-
ability and layout of the final composed poster.”
Images: [Upload the layered poster results]
Evaluation: The assistant scores each method from 1 to 5
and returns the result in JSON format.
A. GPT-4 EvaluationB. User Study
In this section, we provide additional details about the au-
tomatic GPT-4-based evaluation protocol used in our exper-
iments. We describe how candidate layered posters from
different methods are jointly presented to GPT-4, the dis-
crete 1–5 scoring rubric, the JSON output format, and how
we aggregate and normalize these scores to obtain the final
quantitative metric reported in the main paper.
Implementation details of the GPT-4 evaluation. We
adopt GPT-4 as an automatic visual judge to assess the qual-
ity of layered posters produced by different methods. For
each input (either a text description or an image), we collect
all candidate outputs from the compared methods and sub-
mit them together in a single query. GPT-4 then gives each
method an independent score, which allows a fair, side-by-
side comparison under exactly the same context.
The assistant evaluates the layered poster results of dif-
ferent methods according to a 1–5 scale:
• 1 = very poor (severely unreasonable, chaotic structure,
and strong visual inconsistency),
• 2 = poor,
• 3 = fair / acceptable but with clear flaws,
• 4 = good,
• 5 = very good (clear structure, reasonable layer relation-
ships, and visually coherent as a whole).
Scores are output in JSON format, for example:This subsection gives additional information about the user
study conducted to evaluate OmniPSD in both text-to-PSD
and image-to-PSD settings. We describe the participant
pool, the evaluation criteria, and the 5-point Likert rating
protocol, and we summarize how subjective feedback from
designers and students supports the quantitative improve-
ments reported in the main paper.
Text-to-PSD. In the text-to-PSD setting, participants
compared OmniPSD with LayerDiffuse-SDXL and GPT-
Image-1 on 50 text prompts. For each generated layered
poster, they rated two criteria on a 5-point Likert scale: (1)
layering reasonableness (whether foreground, background,
and text are separated in a semantically meaningful way
with plausible occlusion and depth), and (2) overall pref-
erence (the overall visual appeal and usability of the fi-
nal composed poster, including readability and layout). As
summarized in Tab. 6, OmniPSD achieves the highest mean
scores on both criteria, clearly outperforming the baselines.
{
"Method1": 4,
"Method2": 5,
"Method3": 4
}
For each method in a given query, GPT-4 assigns one
integer score within this range based on the overall visual
quality of the layered poster, including the consistency be-
tween layers, the plausibility of occlusion and depth, and
the readability and layout of the final composed poster. The
reported GPT-4 score in the main paper is obtained by av-
eraging these integer scores over all test cases and then lin-
early normalizing the result to [0, 1].
Example of task prompt and evaluation. Prompt: “You
will evaluate layered poster results produced by multiple
methods under the same input. For each method, inspect
Table 6. User study results for the text-to-PSD setting.
Metric
Layering reasonableness
Overall preference
LayerDiffuse-SDXL GPT-Image-1 OmniPSD
3.33
3.39
3.89
3.78
4.39
4.50
Image-to-PSD. In the image-to-PSD setting, participants
evaluated 50 poster images decomposed by OmniPSD and
the same two baselines. For each decomposed result, they
rated three criteria on a 5-point Likert scale: (1) reconstruc-
tion consistency (how well the recomposed poster from the
predicted layers matches the original input image in content
and structure), (2) layering reasonableness (whether the re-
covered layers form a clean and plausible decomposition
with correct occlusion and depth), and (3) overall prefer-
ence (the perceived quality and practical usability of the lay-
ered result as a design asset). Tab. 7 shows that OmniPSD
again obtains the highest mean scores on all three criteria,
with consistent gains over the baselines.
Across both settings, designers particularly praised Om-
niPSD for its “clear layer separation” and “realistic trans-
parency,” which enables direct reuse in professional editing
workflows. These results confirm that OmniPSD provides
superior structural consistency and practical value for real-
14. Table 7. User study results for the image-to-PSD setting.
Metric
Reconstruction consistency
Layering reasonableness
Overall preference
Kontext Nano-Banana GPT-Image-1 OmniPSD
3.05
3.44
3.39
4.06
4.16
4.33
4.11
4.22
4.28
4.56
4.61
4.72
world design generation and reconstruction.
C. More Results
In this subsection, we present additional qualitative results
of OmniPSD in both image-to-PSD reconstruction and text-
to-PSD synthesis. These visual examples cover diverse lay-
outs and contents, illustrating the clarity of the recovered
layers, the realism of transparency and occlusion, and the
overall visual quality of the final composed posters.
Image-to-PSD reconstruction. Figure 7 shows more ex-
amples where OmniPSD decomposes input poster images
into layered PSD files and then recomposes them. The re-
constructions exhibit high fidelity to the original designs
while preserving clean layer boundaries that are convenient
for downstream editing.
Text-to-PSD synthesis. Figure 8 presents additional Om-
niPSD results in the text-to-PSD setting. Given only tex-
tual descriptions, our method synthesizes layered posters
with coherent foreground elements, legible text, and visu-
ally consistent backgrounds, demonstrating its versatility as
a generative design tool.
D. User Interface
In this subsection, we present the interactive user interface
of OmniPSD and demonstrate typical editing workflows
on a representative poster example. Starting from a user-
uploaded image, OmniPSD automatically infers a layered
representation that separates text, foreground objects, and
background regions into editable components. Through in-
tuitive point-and-click operations, users can modify textual
content, remove or replace the background, and delete or
adjust individual graphical elements while preserving the
overall layout and visual coherence of the design. This in-
terface illustrates how OmniPSD couples high-quality layer
decomposition with practical, user-friendly tools for real-
world poster editing and creation.
15. Figure 7. More generation results of OmniPSD image-to-PSD reconstruction.
16. {“poster”: “The complete poster features a romantic theme centered around a heart-shaped ...”, “foreground”: “The foreground content includes
a heart-shaped gift box with a pink ribbon and several red roses ...”, “midground”: “The midground content consists of abstract shapes, including
a large diamond shape with ...”, “background”: “The background content features a soft pink hue with a subtle texture of roses and ..."}
{“poster”: “The complete poster features a vibrant and modern design centered around a striking red rose …”, “foreground”: “The foreground
consists of a detailed red rose, prominently placed in the center of ...”, “midground”: “The midground features a series of geometric shapes in
shades ...”, “background”: “The background is a solid blue with a large yellow circle in the center, complemented by a pink triangular ..."}
{“poster”: “The complete poster features a vibrant and modern design centered around virtual reality. It showcases two ...”, “foreground”: “The
foreground content includes a large percentage symbol in orange ...”, “midground”: “The midground features two individuals, a man and a
woman, both wearing VR headsets ...", "background": "The background consists of a smooth gradient transitioning from dark purple to a ..."}
{"poster": "The complete poster features a vibrant blue abstract design with swirling patterns and bubbles, set against a dark ...", "foreground":
"The foreground content includes two blue star-shaped elements and a vertical blue bar, ...", "midground": "The midground consists of a large
blue abstract shape ...", "background": "The background is a smooth gradient transitioning from dark blue at the top to a lighter shade at ..."}
{“poster”: “The poster features a vibrant and cheerful theme, showcasing a joyful individual surrounded by colorful ...”, “foreground”: “The
foreground content includes a circular image of a smiling person with long, wavy hair, adorned ...”, “midground”: “The midground consists of
two large, stylized green flowers …”, “background”: “The background features a soft, blended gradient of colors, including shades of orange …"}
Figure 8. More generation results of OmniPSD text-to-PSD synthesis.
17. (a) User Inference
(d) Remove Text
(b) Eidt Text
(c) Remove background
(e) Remove element
Figure 9. User interface and functional demonstration of OmniPSD. Given a user-uploaded poster image, OmniPSD enables the addition,
removal, and editing of textual and graphical elements.