Seedream 4.0- Toward Next-generation Multimodal Image Generation
如果无法正常显示,请先停止浏览器的去广告插件。
1. Seedream 4.0: Toward Next-generation Multimodal
Image Generation
ByteDance Seed
Abstract
We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system
that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a
single framework. We develop a highly efficient diffusion transformer with a powerful VAE which
also can reduce the number of image tokens considerably. This allows for efficient training of
our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream
4.0 is pretrained on billions of text–image pairs spanning diverse taxonomies and knowledge-
centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled
with optimized strategies, ensures stable and large-scale training, with strong generalization. By
incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training
both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial
distillation, distribution matching, and quantization, as well as speculative decoding. It achieves
an inference time of up to 1.4 seconds for generating a 2K image (without a LLM/VLM as PE
model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results
on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal
capabilities in complex tasks, including precise image editing and in-context reasoning, and also
allows for multi-image reference, and can generate multiple output images. This extends traditional
T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of
generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on
Volcano Engine α .
Official Page: https://seed.bytedance.com/seedream4_0
α Model
ID: Doubao-Seedream-4.0
Figure 1 Overall evaluation. Left: Text-to-Image results; Right: Image-Editing results. The Elo scores are obtained
from the Artificial Analysis Arena. Seedream 4.0 ranks first in both T2I and image-editing leaderboards, by September
18, 2025.
1
2. Figure 2 Seedream 4.0 visualization.
2
3. Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Data, Model Training and Acceleration
2.1 Model Pre-training . . . . . . . . . . .
2.2 Model Post-training . . . . . . . . . .
2.3 Model Acceleration . . . . . . . . . . .
4
.
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. 5
5
6
6
3 Model Performance . . . . . . . . . . . . . . . . . . .
3.1 Comprehensive Human Evaluation . . . . . . . . .
3.1.1 Text-to-Image . . . . . . . . . . . . . . . . .
3.1.2 Single-Image Editing . . . . . . . . . . . . .
3.1.3 Multi-Image Editing . . . . . . . . . . . . .
3.2 Automatic Evaluation with DreamEval . . . . . . .
3.3 Inspire Creativity via Seedream 4.0 . . . . . . . . .
3.3.1 Precise Editing . . . . . . . . . . . . . . . .
3.3.2 Flexible Reference . . . . . . . . . . . . . .
3.3.3 Visual Signal Controllable Generation . . .
3.3.4 In-Context Reasoning Generation . . . . . .
3.3.5 Multi-Image Reference Generation . . . . .
3.3.6 Multi-Image Output . . . . . . . . . . . . .
3.3.7 Advanced Text Rendering . . . . . . . . . .
3.3.8 Adaptive Aspect Ratio and 4K Generation .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. 7
7
7
8
9
9
10
10
11
11
12
12
12
13
13
4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
A Contributions and Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.1 Core Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2 Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
19
19
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
.
.
.
.
.
.
.
.
4. 1
Introduction
Diffusion models have ushered in a new era in generative AI, enabling the synthesis of images with remarkable
fidelity and diversity. Building on recent advances in diffusion transformers (DiTs), state-of-the-art open-source
and commercial systems have emerged, such as Stable Diffusion [18], FLUX series [7, 8], Seedream models
[3, 4, 21], GPT-4o image generation [15] and Gemini 2.5 flash [5]. However, as the demand for higher image
quality, greater controllability, and strong multimodal capabilities (e.g., text-to-image (T2I) synthesis and
image editing) increases, current models often have a critical scalability bottleneck.
In this paper, we introduce Seedream 4.0, a powerful multimodal generative model engineered for scalability
and efficiency. We develop an efficient and scalable DiT backbone, which substantially increases the model
capacity while reducing the training and inference FLOPs considerably. To further enhance model efficiency,
especially for high-resolution image generation, we have developed an efficient Variational Autoencoder (VAE)
with a high compression ratio, significantly reducing the number of image tokens in latent space. This
architectural design (including our DiT and VAE) makes our model highly efficient, easily scalable, and
hardware-friendly in both training and inference. Our training strategy is meticulously designed to unlock
the full potential of our architecture, achieving more than 10× inference acceleration compared to Seedream
3.0 [3], while having significantly better performance. This allows the model to be trained effectively on
billions of text–image pairs at native image resolutions ranging from 1K to 4K, covering a wide range of
taxonomy and knowledge-centric concepts.
In the post-training stage, we incorporate a VLM model with a strong understanding of multimodal inputs
into our system,as designed in SeedEdit [21], which enables it with strong multimodal generation ability. We
pioneer a joint post-training that integrates both T2I generation and image editing through a causal diffusion
designed in the DiT framework. Our post-training stage starts with Continuing Training (CT) to broaden the
model’s foundational knowledge and multi-task proficiency. This is followed by Supervised Fine-Tuning (SFT),
which works to inculcate specific artistic qualities. Subsequently, we implement Reinforcement Learning from
Human Feedback (RLHF) to meticulously align the model’s outputs with nuanced human preferences. Then
a Prompt Engineering (PE) module is developed to unlock the full potential of the model across a diverse
spectrum of user inputs. To achieve ultra-fast inference on our DiT, we propose a holistic acceleration system
centered on an adversarial learning framework. This core algorithmic advance is synergistically combined with
hardware-aware quantization and speculative decoding, culminating in a system that delivers a second-level
generation experience without quality degradation.
In summary, Seedream 4.0 presents the following advantages:
• Efficient and Scalable Architecture. Our model contains a carefully designed DiT architecture together
with a powerful VAE, yet having a high compression ratio. This results in a highly efficient architecture
that achieves more than 10× training and inference acceleration (measured by compute flops) compared
to Seedream 3.0 [3], with significantly better performance obtained. The efficient architecture of
Seedream 4.0 allows for strong scalability in terms of modal capacity, task coverage, and multimodel
generalization.
• Strong Multi-modal Generation. We extended the Seededit 3.0 [20] architecture for multimodal generation,
and perform multimodal joint post-training of T2I and image editing tasks on our pre-trained DiT model.
This enables Seedream 4.0 with strong multi-modal capability that allows for single- or multi-image
inputs and outputs.
• Professional creation scenarios. Beyond artistic imagery, Seedream 4.0 exhibits a strong capability to
generate structured, professional and knowledge-based content, such as charts, formulas, and design
materials, bridging the gap between creative generation and practical industry applications.
• Ultra-fast inference speed. With efficient architecture design, we further optimize our framework
aggressively to achieve extreme inference acceleration. This allows our model to perform ultrafast image
generation and editing at high resolutions (e.g., 2K or 4K), greatly enhancing user interaction experience
and production efficiency.
Seedream 4.0 has been successfully integrated into multiple platforms, including Doubao and Jimeng as of
4
5. September 2025. It can also be accessible on Volcano Engine α . We believe Seedream 4.0 will become a
practical tool to improve productivity in all aspects of work and daily life.
2
Data, Model Training and Acceleration
2.1
Model Pre-training
Data. In Seedream 3.0, we introduced a dual-axis collaborative data sampling framework that jointly optimizes
pre-training data along two dimensions: visual morphology and semantic distribution. However, we observed
two limitations when applying a purely top-down resampling strategy:
• It disproportionately favors natural images in the overall distribution.
• It underrepresents fine-grained, knowledge-centric concepts (e.g., instructional content and mathematical
expressions).
To address these issues, we redesigned the pipeline specifically for knowledge-related data, including instruc-
tional images and formulae.
In our pipeline, knowledge data are categorized into natural and synthetic subsets. For natural images, we
collect high-quality figures from PDF documents that span in-house textbooks, research articles, and novels.
We first deploy a low-quality classifier to filter out undesirable samples (e.g. blurred images, cluttered, or noisy
backgrounds). Next, we train a difficulty rating classifier with three levels: easy, medium, and hard, and we
annotate all images accordingly. Images with extremely difficulty are down-sampled during pre-training. For
synthetic data, we used both OCR output and LaTeX source code (when available) to generate diverse formula
images that vary in structure (layout, symbol density) and resolution. This synthesis strategy broadens the
coverage of fine-grained concepts and mitigates the biases introduced by top-down resampling.
Beyond knowledge-related data, we introduce several module-level upgrades compared to our previous version.
(1) we train a text-quality classifier to detect low-quality text in original caption; (2) we combine semantic
and low-level visual embeddings in the deduplication pipeline to boost the deduplication results, balancing
fine-grained distribution; (3) we refine the captioning model for finer-grained visual descriptions; and (4) we
adopt a stronger cross-modal embedding for image–text alignment, substantially improving our multimodal
retrieval engine.
Training Strategies. Similar to Seedream 3.0 [3], we adopt multi-stage training to improve training efficiency.
In the first stage, we train our DiT at an average resolution of 512 2 (with different aspect ratios). In the
second stage, we fine-tune our model at higher resolutions spanning from 1024 2 to 4096 2 . Thanks to the
efficient design of our model, training on these higher resolutions up to 4K is still effective.
Training Infrastructures. To enable efficient large-scale pre-training of the DiT model, we design a highly
optimized training system that emphasizes hardware efficiency, scalability, and robustness. The key components
are summarized as follows.
Parallelism and Memory Optimization. We employ Hybrid Sharded Data Parallelism (HSDP) to efficiently
shard weights and support large-scale training without resorting to tensor or expert parallelism. Memory
usage is optimized through timely release of hidden states, activation offloading, and enhanced FSDP support,
enabling training of large models within available GPU resources.
Kernel and Workload Optimization. Performance-critical operations are accelerated by combining torch.compile
with handcrafted CUDA kernels and operator fusion, reducing redundant memory access. To address workload
imbalance from variable sequence lengths, we introduce a global greedy sample allocation strategy with
asynchronous pipelines, achieving more balanced per-GPU utilization.
Fault Tolerance. Multi-level fault tolerance is built into the system, including periodic checkpointing of model,
optimizer, and dataloader states, pre-launch health checks to exclude faulty nodes, and reduced initialization
overhead. These measures ensure stability and sustained throughput during long-term distributed training.
1 https://www.doubao.com/chat/create-image
2 https://jimeng.jianying.com/ai-tool/image/generate
5
6. 2.2
Model Post-training
We perform an intensive post-training to enhance the multimodal capabilities of our model, including T2I,
single image editing, and multi-image reference and output. Specifically, we perform joint training of multiple
tasks through multistage post-training, containing continuing training (CT), supervised fine-tuning (SFT), and
human feedback alignment (RLHF) [11, 23–25]; and a carefully fine-tuned prompt engineering (PE) model was
also developed. Performance is improved consistently and significantly at each substage, resulting in higher
performance compared to models trained on individual tasks. In particular, the CT stage mainly enhances
the instruct following ability for image editing, and the SFT further improves the consistency between the
reference and edited images considerably.
We construct a large amount of editing data that is used in the CT and SFT stages. Each data sample
typically has a reference image and a target image, with an editing instruction. Image captions are produced
for both reference and target images. We designed three types of caption with different levels of detail, which
function as a form of data augmentation during training. In addition, we encourage the use of consistent
terminology in captions to describe similarities between reference and target images.
We trained an end-to-end vision language model (VLM) as our PE model based on Seed1.5-VL [6]. This
VLM model processes user input, including a text prompt, a single reference image, or multiple images, and
generates the corresponding outputs (for example, the captions of the reference image and the target or
predicted image, as did SeedEdit 3.0 [20]), which are then used as input to the DiT model. The functions
of the PE model also include task routing, prompt rewriting (with auto-thinking), and optimal aspect ratio
estimation. To balance latency and performance, our model dynamically adjusts its thinking budgets based on
task complexity, inspired by AdaCoT [12]. This integrated approach enables Seedream 4.0 to better address
user intentions, perform complex reasoning, and generate a series of images from a single request.
2.3
Model Acceleration
Efficient, High-Quality Synthesis. Our acceleration framework integrates principles from Hyper-SD [17],
RayFlow [19], APT [10], and ADM [13] to accelerate Diffusion Transformers (DiTs). Our approach establishes
an innovative paradigm where each sample follows an optimized, adaptive trajectory, rather than a shared
path to a Gaussian prior. This customization minimizes trajectory overlap and reduces instability. To learn
these paths effectively, we employ an adversarial matching framework that replaces fixed divergence metrics,
circumventing mode collapse and significantly improving generation stability and sample diversity. This is
achieved through a two-stage process, starting with a robust Adversarial Distillation post-training (ADP) stage
that uses a hybrid discriminator to ensure a stable initialization. Following this, an Adversarial Distribution
Matching (ADM) framework employs a learnable, diffusion-based discriminator for fine-tuning, enabling a
more fine-grained matching of complex distributions. Our unified pipeline enables highly efficient few-step
sampling, drastically reducing the Number of Function Evaluations (NFE) while achieving results that match
or surpass baselines requiring dozens of steps across key dimensions like aesthetic quality, text-image alignment,
and structural fidelity, effectively balancing quality, efficiency, and diversity.
Quantization. To further boost inference performance without quality loss, we employ a hardware-aware
framework combining quantization and sparsity. Our approach uses an adaptive 4/8-bit hybrid quantization,
which involves offline smoothing to handle outliers, a search-based optimization to find the best granularity
and scaling for sensitive layers, and post-training quantization (PTQ) to finalize parameters. We co-design this
with efficient, hardware-specific operators for various bit widths and granularities to maximize performance.
Speculative Decoding for PE. Our method builds upon the foundational work of Hyper-Bagel[14] to address
the inherent uncertainty in speculative decoding that arises from stochastic token sampling. Our solution
conditions feature prediction on both the preceding feature sequence and a token sequence advanced by one
timestep. This provides a deterministic target that resolves sampling ambiguity and significantly enhances
prediction accuracy. We further improve this process by incorporating a loss function on Key-Value (KV)
caches to enable efficient reuse during inference and an auxiliary cross-entropy loss on logits to refine the
draft model.
6
7. Artificial Analysis Text to Image Leaderboard
Artificial Analysis Image Editing Leaderboard
Figure 3 Results from the Artificial Analysis Arena 1 : Seedream 4.0 leads in both the Text-to-Image and Image Editing
tracks.
3
Model Performance
In this section, we present a comprehensive evaluation of Seedream 4.0. First, we report the results of an
overall ELO score from a public platform, Artificial Analysis Arena [1], as shown in Figure 3. By maintaining
a real-time competitive arena, Artificial Analysis Arena continuously incorporates newly released models and
provides dynamic leaderboards. The participants cover the Seedream series (Seedream 3.0 [3], Seedream 4.0,
and SeedEdit 3.0 [20]) alongside other leading models, including GPT-Image-1 [16], Gemini-2.5 Flash Image [5]
(abbreviated as Gemini-2.5), and open-source models such as Qwen-Image [22], and FLUX-Kontext[8]. The
results indicate that Seedream 4.0 ranks first in both the single-image editing and text-to-image tracks.
To further explore the fine-grained capabilities of Seedream 4.0, we provide both human evaluation and
automated benchmarking results. Across various tasks and dimensions, Seedream 4.0 consistently delivers
top-tier performance. Finally, we highlight its excellent multimodal image generation capabilities. By flexibly
combining input and output modalities, Seedream 4.0 can support a wide range of creative applications. We
also present illustrative examples of these functionalities, while noting that further possibilities remain to be
explored through user interaction.
3.1
Comprehensive Human Evaluation
To benchmark the performance of Seedream 4.0 against other top-tier image generation models, we constructed
a comprehensive multimodal benchmark, MagicBench 4.0. The benchmark covers three major task categories:
text-to-image (T2I) generation, single-image editing, and multiimage editing. The three tracks consist of 325
prompts, 300 prompts, and 100 prompts, respectively; and each prompt is provided in both Chinese and
English. In the following sections, we present a detailed analysis of various models in these tasks.
3.1.1
Text-to-Image
As a fundamental capability of image generation models, T2I generation has always been a key focus of
the Seedream series. In addition to conventional dimensions such as prompt alignment, structural stability,
and visual aesthetics, we evaluate the model’s performance in two additional aspects: dense text rendering
and content understanding. The latter is particularly relevant to prompts that require advanced in-context
reasoning or specialized domain knowledge. As shown in Figure 1, Seedream 4.0 demonstrates significant
improvements in all evaluation dimensions compared to its predecessor. In particular, it substantially
outperforms competing models in visual aesthetics. Figure 4 presents several challenging cases of T2I by
comparing Seedream 4.0 with GPT-Image-1 and Gemini-2.5. These samples illustrate the model’s capability
in ensuring strong instruction adherence, precise text rendering, and stable visual quality across models, and
1 Data
as of 17:00 (Beijing Time), September 18. https://artificialanalysis.ai/text-to-image/arena?tab=Leaderboard
7
8. Figure 4 Text-to-Image comparisons. Within each group, images are arranged from left to right as Seedream
4.0, GPT-Image-1, and Gemini-2.5. Prompt1 : Draw an infographic showing the causes of inflation.; Prompt2 : A
studio-style portrait of an elderly man: he ..., with his left eye tightly closed in a witty and amusing manner.. Prompt3 :
狸花猫第一视角自拍,老虎在后面追,超现实主义,写实,黄昏背景; Prompt4 : A handsome male model whose left
eye pupil is blue wears a torn suit with a pink rose in his left hand; there is only a single dewdrop clearly visible on the
edge of the rose petal, as if it were about to fall from the petal to the ground in the next second.
Seedream 4.0 stands out for its superior visual impact, including a dynamic sense of motion, natural lighting,
and coherent color composition.
3.1.2
Single-Image Editing
Figure 5 Single-Image Editing Comparisons. Within each group, images are arranged from left to right as the original
image, Seedream 4.0, GPT-Image-1, and Gemini-2.5. Prompt1 : 把图中的两只狗改为乐高风格.; Prompt2 : Translate
the text of the product image into English. Prompt3 : Please modify the photo’s perspective to make it look like it
was taken directly facing the stairs.; Prompt4 : 在图中红框位置添加一只长颈鹿.
Seedream 4.0 integrates both editing and generation capabilities in a unified pipeline, enabling them to
mutually enhance each other and delivering performance that exceeds the previous version, SeedEdit 3.0. A
crucial challenge in image editing lies in the trade-off between instruction following and consistency, which
is also what our evaluation focuses on. In addition, we also consider structural integrity and text-editing
performance. As shown in Figure 1, the results reveal distinct patterns in all leading models. GPT-Image-1
achieves the highest accuracy in instruction following, but ranks lowest in consistency, which is a limitation
that has been widely noted. In contrast, Gemini-2.5 excels at preservation, but shows limited capability in
instruction following, particularly for tasks such as style transfer and viewpoint transformation, as shown in
Figure 5; it also struggles with text editing, especially in Chinese. Seedream 4.0, by comparison, demonstrates
more balanced performance in all dimensions. It supports a wide range of editing tasks, maintains strong
8
9. consistency, and thus achieves a higher level of practical usability.
3.1.3
Multi-Image Editing
Figure 6 Multi-image editing comparison. Left: overall evaluation results. Right: two examples, with dashed boxes
indicating input references and outputs shown left-to-right as Seedream 4.0, GPT-Image-1, and Gemini-2.5. In the
rightmost example, the top-right image is from Seedream 4.0. Prompt1 : Referring to the color chart in picture two,
transform the room and all the furniture into a dopamine style.; Prompt2 : Place the items in image one on the top
shelf of the cabinet. Place the items in image two on the middle shelf. Place the items in image three on the bottom
shelf of the cabinet.
Multi-image editing goes beyond the simple combination of multiple images; it requires models to perform
rich in-context understanding of objects across different inputs. We compare the performance of Seedream
4.0 against GPT-Image-1 and Gemini-2.5 using an overall metric (GSB) with three objective dimensions:
alignment of the instruction, consistency and structure. As illustrated in Figure 6, the results mirror those of
single-image editing: GPT-Image-1 shows strength in instruction responsiveness but weak consistency, while
Gemini-2.5 excels in preservation, but falls short in responsiveness. In contrast, Seedream 4.0 performs at
the highest levels in all dimensions, outperforming the other two models by almost 20% in the GSB metric.
A crucial consideration in multi-image editing is structural integrity. We observed that as the number of
reference images increases, the outputs of other models tend to suffer from structural degradation. However,
Seedream 4.0 maintains more stable and coherent structures, demonstrating robust performance even when
provided with more than ten reference images.
3.2
Automatic Evaluation with DreamEval
Figure 7 Automatic Evaluation with DreamEval. Left: Text-to-Image; Right: Single-Image Editing.
Automatic evaluation is an essential component of model evaluation as it enables large-scale testing and
provides more stable and rapid feedback compared to purely human assessments. We introduce DreamEval, a
comprehensive multimodal benchmark comprising four generation scenarios, containing 128 sub-tasks with
9
10. Figure 8 Examples of precise editing.
1,600 prompts. The scoring process is broken down into fine-grained visual-question-answer for each prompt,
making evaluation more interpretable and deterministic. In particular, DreamEval also incorporates tiered
difficulty levels that separately assess basic generation skills, advanced generation abilities, and higher-order
understanding and reasoning capacity.
The overall results for instruction following are shown in Figure 1, and Figure 6, where the results are
well aligned with human evaluations. Detailed performance across three levels of difficulty in the T2I and
single-image editing tasks is shown in Figure 6. Several observations can be obtained: (1) Seedream 4.0
and GPT-4o outperform other models in instruction adherence, although Seedream 4.0 exhibits greater
variability: it has a slightly lower average score, but its "best-of-4" results are better, suggesting that users
can obtain better outputs through sampling. (2) Seedream 4.0 performs well at the Easy and Medium levels,
demonstrating strong generative responsiveness; but its performance drops at the Hard level, especially in
single-image editing. This highlights the need for improvement in multi-modal understanding and reasoning,
which will be improved by scaling our models with related data.
3.3
Inspire Creativity via Seedream 4.0
We present several illustrative use cases of Seedream 4.0 in this section. These examples highlight only part of
its potential, as further creative applications will emerge through user exploration.
3.3.1
Precise Editing
Image editing has long been a critical challenge for generative models, with the main difficulty lying in achieving
the desired modifications while preserving the majority of the original visual characteristics. Seedream 4.0
enables high-quality image editing solely from prompt-based input. It demonstrates strong instruction-
following capability, delivering precise modifications while largely preserving the integrity of the surrounding
10
11. Figure 9 Examples of reference generation.
visual content. As illustrated in Figure 8, beyond canonical tasks such as addition, deletion, modification,
and replacement, Seedream 4.0 shows remarkable performance in a variety of practical editing scenarios. For
instance, in background replacement, it seamlessly integrates the foreground with other elements, and in
portrait retouching, it delivers results that exhibit photographic realism.
3.3.2
Flexible Reference
Unlike image editing, reference-based generation presents a more challenging trade-off between preservation
and creativity. This difficulty arises from the inherently ambiguous definition of what should be preserved.
It may be a person’s ID or IP, a particular artistic style, or even an abstract concept. Consequently, the
range of possible applications is broader and more diverse. As illustrated in Figure 9, Seedream 4.0 supports
seamless transformations across 2D/3D domains with varying viewpoints. It enables derivative designs such
as dolls, clothes, or memes from a single reference image. In addition, benefiting from the strong consistency
by Seedream 4.0, it can be effectively applied to identity-sensitive scenarios, including generating portrait
photographs in different styles or creating characters for film.
3.3.3
Visual Signal Controllable Generation
Visual signals, such as Canny edges, sketches, inpainting masks, or depth maps, have long been a crucial
component of controllable generation. They enable a transfer of information that is often difficult to describe
through language, such as human poses or precise spatial layouts, thereby allowing for more accurate and
targeted generation. Traditionally, this capability has been developed using multiple specialized models such
as ControlNet [9, 26]. In contrast, Seedream 4.0 natively integrates these functionalities with a single model.
Beyond supporting the common forms of visual guidance, it also accommodates creative inputs through simple
strokes or sketches, and even enables new multi-image compositions driven by visual signals. Illustrative
examples are provided in Figure 10.
11
12. Figure 10 Examples of visual signal controllable generation.
3.3.4
In-Context Reasoning Generation
With the increasing intelligence of multimodal models, a new paradigm for in-context reasoning generation
has emerged. Traditional image generation aims primarily at producing outputs that strictly follow the
instructions given. In contrast, reasoning-based generation requires the model to go a step further: it must
extract implicit contextual cues and infer plausible outcomes before genearating the image. This process may
involve expanding the original prompt and interpreting reference images. As illustrated in Figure 11, Seedream
4.0 demonstrates reasoning capabilities across various in-context understanding tasks, including interpreting
physical and temporal constraints of the real world, as well as imagining three-dimensional space. Additionally
, Seedream 4.0 can also perform tasks such as puzzle solving, crossword filling, and comic continuation, all
while faithfully preserving the visual style and fine-grained details of the given input.
3.3.5
Multi-Image Reference Generation
Benefiting from the richer information provided by multiple images, multi-image reference enables more
imaginative and versatile applications. Beyond conventional tasks such as virtual try-on or image collage,
it supports flexible composition of multiple characters or objects, as well as abstract style transfer. Unlike
text conditioning, which requires explicitly specifying attributes or styles, multi-image editing compels the
model to autonomously extract salient features from reference images and transfer them to the target. As
illustrated in Figure 12, Seedream 4.0 can handle reference-based editing with more than ten input images,
while maintaining high fidelity in transferring abstract styles such as origami or Baroque aesthetics. Moreover,
it effectively manages relative object scales and produces meaningful multi-object compositions, such as
assembling mechanical parts, demonstrating a robust understanding of physical-world structures.
3.3.6
Multi-Image Output
Single-image generation is insufficient for many creative scenarios that require coherent multi-image outputs.
Leveraging strong capabilities in global planning and in-context consistency, Seedream 4.0 supports the
generation of image sequences that remain both character-consistent and stylistically aligned. As illustrated in
12
13. Figure 11 Examples of reasoning generation.
Figure 13, this enables sequential images generation based on given characters, which is particularly beneficial
for storyboarding and comic creation. Seedream 4.0 can also produce sets of images with a consistent visual
identity, a feature highly valuable for IP-based product design and the creation of emoji.
3.3.7
Advanced Text Rendering
Seedream 4.0 introduces enhanced text-rendering capabilities that go beyond mere demonstration to serve
practical applications. With intelligent understanding and extension as well as high-precision dense text
rendering capabilities, it supports various complex text and graphic generation tasks, including designing
layouts for user interfaces, posters, or schematics, as well as generating knowledge-intensive visualizations such
as mathematical formulas, chemical equations, or statistical charts, as shown in Figure 14. Such capabilities
make it feasible for the model to directly produce educational materials, technical manuals, or marketing
content. In addition, Seedream 4.0 enables precise text-aware editing, including content replacement, layout
adjustment, and font modification, thereby extending its rendering capacity to practical workflows and offering
support for work-related scenarios.
3.3.8
Adaptive Aspect Ratio and 4K Generation
Traditional generation models typically require a specified resolution, and selecting an unsuitable aspect ratio
can lead to suboptimal composition and layout. Seedream 4.0 introduces an adaptive aspect ratio mechanism
(while still supporting user-specified size), enabling the model to automatically adjust the canvas according to
either the semantic requirements or the reference objects’ shapes. As illustrated in Figure 15, it allows the
generation of images with more aesthetically pleasing and contextually appropriate compositions. Moreover,
Seedream 4.0 further extends its supporting resolution up to 4K. This advancement goes beyond research
prototypes, delivering image quality suitable for commercial applications.
13
14. Figure 12 Examples of multi-image composition.
Figure 13 Examples of multi-image output generation.
4
Conclusion
In this report, we present Seedream 4.0, an advanced multimodal image generation framework that integrates
an efficient and scalable diffusion transformer with a high-compression VAE, achieving more than ten times
14
15. Figure 14 Examples of advanced text rendering.
the acceleration of the previous Seedream 3.0 model, while delivering superior performance on all aspects.
By performing joint post-training on T2I and image editing tasks, Seedream 4.0 provides strong multimodal
generation capabilities that support diverse inputs and outputs. It demonstrates broad potential for creative
exploration, including precise image editing, reference-based generation, multi-image composition, and multi-
image output. In particular, the designed model architecture is highly efficient and stable, which allows us to
scale it effectively, with considerable performance improvement achieved (in our on-going work). Furthermore,
with advanced inference acceleration technologies, Seedream 4.0 enables ultrafast image generation and editing
at high resolutions. In addition, it also has the strong ability to support complicated content generation
for professional scenarios, such as knowledge-centric generation, which is difficult to perform by previous
models. With its integration into platforms such as Doubao and Jimeng/Dreamina [2], Seedream 4.0 shows
great potential to become a powerful productivity tool in content creation, for professional and everyday
applications.
15
16. 2.0-1k
3.0-2k
4.0-4k
Figure 15 Examples of adaptive ratio and 4K generation.
16
17. References
[1] artificialanalysis.ai. artificialanalysis. https://artificialanalysis.ai/text-to-image/arena?tab=Leaderboard, 2025.
[2] dreamina. dreamina. https://dreamina.capcut.com/, 2025.
[3] Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao,
Liyang Liu, et al. Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346, 2025.
[4] Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun
Shi, et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model. arXiv preprint
arXiv:2503.07703, 2025.
[5] Google. gemini2.5. https://deepmind.google/models/gemini/image/, 2025.
[6] Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang,
Jiawei Wang, et al. Seed1. 5-vl technical report. arXiv preprint arXiv:2505.07062, 2025.
[7] Black Forest Labs. Flux. https://github.com/black-forest-labs/flux, 2023.
[8] Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne,
Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li,
Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith.
Flux.1 kontext: Flow matching for in-context image generation and editing in latent space, 2025. URL https:
//arxiv.org/abs/2506.15742.
[9] Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, and Chen Chen. Controlnet++:
Improving conditional controls with efficient consistency feedback. In European Conference on Computer Vision,
pages 129–147. Springer, 2025.
[10] Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training
for one-step video generation. arXiv preprint arXiv:2501.08316, 2025.
[11] Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli
Ouyang. Flow-grpo: Training flow matching models via online rl. arXiv preprint arXiv:2505.05470, 2025.
[12] Chenwei Lou, Zewei Sun, Xinnian Liang, Meng Qu, Wei Shen, Wenqi Wang, Yuntao Li, Qingping Yang, and
Shuangzhi Wu. Adacot: Pareto-optimal adaptive chain-of-thought triggering via reinforcement learning. arXiv
preprint arXiv:2505.11896, 2025.
[13] Yanzuo Lu, Yuxi Ren, Xin Xia, Shanchuan Lin, Xing Wang, Xuefeng Xiao, Andy J Ma, Xiaohua Xie, and
Jian-Huang Lai. Adversarial distribution matching for diffusion distillation towards efficient image and video
synthesis. arXiv preprint arXiv:2507.18569, 2025.
[14] Yanzuo Lu, Xin Xia, Manlin Zhang, Huafeng Kuang, Jianbin Zheng, Yuxi Ren, and Xuefeng Xiao. Hyper-
bagel: A unified acceleration framework for multimodal understanding and generation, 2025. URL https:
//arxiv.org/abs/2509.18824.
[15] OpenAI. Gpt-4o. https://openai.com/index/introducing-4o-image-generation/, 2025.
[16] OpenAI, :, Aaron Hurst, and Adam Lerer et al. Gpt-4o system card, 2024. URL https://arxiv.org/abs/2410.
21276.
[17] Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, and Xuefeng Xiao. Hyper-sd:
Trajectory segmented consistency model for efficient image synthesis. Advances in Neural Information Processing
Systems, 37:117340–117362, 2025.
[18] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
[19] Huiyang Shao, Xin Xia, Yuhong Yang, Yuxi Ren, Xing Wang, and Xuefeng Xiao. Rayflow: Instance-aware
diffusion acceleration via adaptive flow trajectories. arXiv preprint arXiv:2503.07699, 2025.
[20] Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing, 2024. URL
https://arxiv.org/abs/2411.06686.
17
18. [21] Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing. arXiv preprint
arXiv:2411.06686, 2024.
[22] Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao
Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng,
Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng,
Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wensen Feng, Xiaoxiao Xu, Yi Wang, Yichang
Zhang, Yongqiang Zhu, Yujia Wu, Yuxuan Cai, and Zenan Liu. Qwen-image technical report, 2025. URL
https://arxiv.org/abs/2508.02324.
[23] Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al.
Rewarddance: Reward scaling in visual generation. arXiv preprint arXiv:2509.08826, 2025.
[24] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward:
Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information
Processing Systems, 36, 2024.
[25] Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo,
Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025.
[26] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models,
2023. URL https://arxiv.org/abs/2302.05543.
18
19. Appendix
A
Contributions and Acknowledgments
All contributors of Seedream are listed in alphabetical order by their last names.
A.1
Core Contributors
Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang,
Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao,
Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian,
Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wenxu Wu, Yonghui Wu, Xin
Xia, Xuefeng Xiao, Shuang Xu, Xin Yan, Ceyuan Yang, Jianchao Yang, Zhonghua Zhai, Chenlin Zhang, Heng
Zhang, Qi Zhang, Xinyu Zhang, Yuwei Zhang, Shijia Zhao, Wenliang Zhao, Wenjia Zhu
A.2
Contributors
Haoshen Chen, Kaixi Chen, Tiantian Cheng, Fei Ding, Xiaojing Dong, Xin Dong, Yiming Fan, Yongde Ge,
Shucheng Guo, Bibo He, Jiaao He, Zhuo Jiang, Lurui Jin, Hongwei Kou, Bo Li, Changchun Li, Hao Li, Huixia
Li, Jiashi Li, Yameng Li, Ying Li, Yiying Li, Zijie Li, Heng Lin, Zhijie Lin, Gaohong Liu, Mingcong Liu, Shu
Liu, Zuxi Liu, Zhangfan Lu, Xiaonan Nie, Shuang Ouyang, Ronggui Peng, Keer Qin, Xudong Sun, Yang Tai,
Rupeng Tian, Lei Wang, Sen Wang, Xuanda Wang, Yinuo Wang, Shaojin Wu, Xiaohu Wu, Wenpeng Xiao,
Yihang Yang, Yao Yao, Linxiao Yuan, Dingyun Zhang, Kai Zhang, Manlin Zhang, Xinlei Zhang, Yanling
Zhang, Yun Zhang, Zixuan Zhang, Fengxuan Zhao, Hao Zheng, Jianbin Zheng
19