Introduction
Over the past few years, image generation has seen remarkable progress. Diffusion and flow-matching models can generate high-resolution images, produce sharp photorealism and stable structure, render dense text, encode broad world knowledge, and follow user prompts in precise detail. These improvements have been driven by several interacting factors including scalable transformers architectures, improved captioning and text encoders, better latent representations, and pipelined post-training techniques. Yet as the field has optimized for reliability on these capabilities, many systems have converged toward a narrow set of default aesthetics. While effective production tools, this makes them less effective as engines for creative exploration, where users often need to search across styles, moods, compositions and visual directions rather than receive a single polished default.
To address these limitations, we present Krea 2, a series of foundation models focused on creative exploration. Krea 2’s models are built on the belief that image generation should be an exploratory medium: expressive enough to span many aesthetics, and controllable enough for creators to navigate them.
We built a large-scale data infrastructure and distributed training framework from scratch to curate a comprehensive pretraining dataset with broad world knowledge and style coverage.
Using this infrastructure, we train expressive models through a multi-stage pipeline spanning pretraining, midtraining, supervised finetuning (SFT), preference optimization, and reinforcement learning (RL), with each stage designed to progressively refine the model’s output distribution. We develop a simple yet performant diffusion transformer (DiT) architecture through thorough ablations. Our model incorporates several components that accelerate convergence , including iREPA, improved VAEs, and Qwen3-VL. We also integrate several architectural improvements, including grouped-query attention (GQA), sigmoid-gated attention, lightweight timestep modulation, and multilayer feature aggregation for text-encoder features, which together improve training stability and efficiency.
A strong base model is only useful if users can reliably reach the parts of its distribution they care about. In training, the model learns from rich, carefully constructed captions that describe images with dense visual detail. In practice, user inputs are often shorter, more ambiguous, and shaped by many different habits of expression. Some users describe a scene in natural language; others gesture toward a mood, a style, or a reference image. This creates a gap between the model’s learned conditioning space and the way creative intent is expressed at inference time.
To reduce this gap, we build two systems that make Krea 2 more exploratory and steerable from both text and image inputs: a prompt expander and a style-reference system. The prompt expander maps simple or underspecified user prompts into richer visual directions without overwriting the user’s intent. It is trained through a two-stage SFT and RL pipeline on top of open-source LLMs, where the objective is not only to improve image quality, but also to encourage creative variation and controllable exploration. Complementing this textual interface, the style-reference system lets users express visual intent through images when words are insufficient. It allows users to inject the style or mood of one or more reference images with minimal content leakage, while providing fine-grained control over style strength and weighted style mixing.
Together, these components define Krea 2 as a foundation model for exploratory generation. Instead of optimizing only for a single polished default, Krea 2 is designed to expose a broad visual space and give users practical ways to move through it, using both text and image-based control. Krea 2 is among the top 10 models on the the Artificial Analysis leaderboard for text-to-image, and scores 2nd place among models from independent labs. Krea 2 serves as a comprehensive baseline and enables a creative generative experience while maintaining competitive performance.
Data
Data Curation Principles
Before detailing our data pipeline, it is important to establish what constitutes a good data mix for our purpose. A good mix does not consist solely of “high quality” images. Diversity and broad domain coverage are essential given our objective of building an expressive, stylistically diverse model. We argue that conventional model-based filtering, which uses aesthetic-score and image-quality-assessment (IQA) models, introduces implicit biases. For example, such methods may classify a blurry image as low quality, even though motion blur or softness can be a deliberate artistic choice.
Furthermore, we argue that as long as a caption accurately describes its image, even an undesirable image may be helpful in downstream use cases: because the model precisely understands the undesired behavior, such samples can later be used to steer generations away from that distribution.
For these reasons, we build the pretraining dataset by filtering out only:
Duplicated samples and over-represented concepts.
Samples for which VLMs consistently fail to capture important aspects of the image.
Samples that induce undesired biases and artifacts.
Samples with high visual complexity that is too difficult to model reliably at low resolution.
AI-generated samples
These conditions shape a pretraining dataset with broad coverage while avoiding poor text-to-image alignment and artifacts.
Importantly, we use no AI-generated images in our pretraining mix. Synthetic data and distillation can be an effective shortcut for acquiring model capabilities. However we find that even a small proportion of AI-generated images introduces biases into the model’s output distribution, as synthetic images tend to be easier to learn, which effectively imposes an upper bound on model quality. We therefore designed in-house classifiers to filter such images out.
Captioning
We employ a multi-stage approach to produce captions. First, we run an OCR model on each target image to extract any visible text. In the second stage, we provide both the OCR results and any available metadata (camera settings, known entities, and so on) to the captioning model, which produces an enriched caption that incorporates world knowledge alongside the extracted text.
General captioning pipeline
Once a context-rich, long-form natural-language caption is obtained, we use a cheaper LLM to reformat it into a variety of lengths and formats, exposing the model to a range of prompt styles. Empirically, we find that training on long prompts provides dense supervision, yielding faster convergence and lower training loss. For many downstream and applied use cases, however, performance on short and medium-length prompts remains important. We therefore train predominantly on long captions while ensuring the model is exposed to short and medium-length prompts throughout training.
Our overall training pipeline and data stages
Pretraining Data
Pretraining data spans 256px, 512px, and 1024px resolution stages. Progressively scaling the resolution forms a curriculum-learning strategy: we dedicate the majority of FLOPs to the low-resolution stages to build core model capabilities efficiently, then equip the model with high-fidelity generation capabilities as the training resolution increases.
Low-resolution pretraining is the stage at which basic text-image alignment and structure are learned. At this stage the dataset is on the order of billions of images, so we rely heavily on inexpensive CPU-based filters to remove low-quality images. These range from simple broken-file, resolution, and aspect-ratio filters that remove unqualified images, to Laplacian filters that remove images with extreme textures and noise patterns.
As an example, one issue we encountered while pretraining K2 was a tendency for the model to generate flat-color backgrounds and border artifacts. To mitigate this, we used RGB entropy, white/black pixel ratios, custom heuristics, and in-house classifiers to filter out samples that induced this behavior.
Building an in-house classifier, one effective strategy was to use a large VLM to craft a task-specific system prompt for the filtering task (for example, detecting a specific pattern or artifact), produce a pseudo-labeled dataset, and then train a small DINOv3- or SigLIP-2-based classifier to run the filter at scale. Any filtering model that requires GPU compute at the low-resolution stage is kept under 1B parameters for efficiency.
For deduplication at the low-resolution stages, we primarily use inexpensive hash-based methods, combining md5, phash, and colorhash to remove duplicate images with minimal compute. We find that the default 8x8 phash does not account for color and has a high false-positive rate; we therefore combine a 12x12 phash with colorhash for more robust deduplication.
As we scale the training resolution, we introduce image-quality and aesthetic filters. Importantly, these quality scores are used only to drop images of extremely poor quality, not to oversample images on the basis of their scores. We additionally use an image-complexity score and text density (from OCR results) to exclude images whose text and content cannot be meaningfully represented at low resolution. We adjust the quality, complexity, and text-density thresholds as training progresses.
Beyond conventional quality filters, we also train a sparse autoencoder (SAE) on SigLIP-2 embeddings computed over a sample of our pretraining corpus. After training the SAE, we use a VLM to annotate each SAE feature based on its top-k activating samples. These annotated features form an unsupervised tagging system in which we extract the predominant SAE features from each image. This tagging system was useful for filtering clear visual artifacts without training an explicit classifier.
Midtraining Data
Unlike the pretraining stages, midtraining explicitly selects specific image sources known to offer good stylistic coverage and high-quality images for particular visual domains. Whereas pretraining is a bottom-up process that begins from a general pool, midtraining data is curated top-down: the domains and sources are chosen first. Midtraining is a crucial stage that smoothly bridges the general pretraining distribution and the high-quality SFT distribution. To improve the quality of the distribution, we introduce semantic clustering and use retrieval-based strategies to ensure world-knowledge coverage.
Building on the approach in Automatic Data Curation for Self-Supervised Learning, we use FAISS to perform hierarchical k-means clustering, which we then sample so as to retain long-tail visual concepts without wasting compute over-sampling head concepts. After computing the hierarchical clusters, we have a VLM examine the images nearest each cluster centroid in order to name and, where appropriate, flag the cluster. Following human review of the flagged clusters, we dropped several that were low quality or problematic. We remove further redundant data through semantic deduplication, computing the SigLIP similarity between images within each remaining leaf cluster.
An important capability of image generation models is faithfully representing known entities that users may reference simply by name. Some entities, such as sports players or actors, can fall into semantic clusters containing many other entities, which risks their being dropped under straightforward hierarchical sampling. To address this, we ran PageRank over English Wikipedia using Danker and retained the top 90% of articles by rank. We then filtered out all articles describing unrepresentable subjects based on their Wikidata metadata, and for the remaining ~5 million concepts we performed a full-text search across all captions in our dataset to assess coverage. When sampling, we prioritized images whose captions referenced rare concepts. Finally, we repeated this coverage analysis on the resulting sample to confirm that no concepts present in the initial dataset had been dropped entirely.
Supervised Finetuning Data
For the supervised finetuning (SFT) stage, we use a small, hand-curated dataset focused on individual visual domains. We find that, once a sufficient volume is reached, the quality of the dataset matters far more than its scale.
Architecture
For our architectural ablations, we found it useful to classify each ablation’s objective into one of the following categories:
Stability: Does it make training more stable? Does it reduce loss and gradient spikes?
Performance: Does it make the model converge faster? If so, does the trend hold over an extended horizon and at higher resolution?
Efficiency: Does it reduce parameter count, FLOPs, memory, or communication requirements without compromising model quality?
Simplicity: Can we make the model simpler without affecting the other categories?
It is worth noting that many of our architectural decisions are guided by their adoption in the LLM space. Choosing an architecture that is well established in the LLM ecosystem allows us to take advantage of existing kernels and optimizations, even for diffusion models.
With these objectives in mind, we begin from the following baseline.
Transformer block
We begin by replacing the GeLU MLP with SwiGLU layers at a 4x expansion factor, which have become a de facto module in LLM architectures. Introducing SwiGLU led to consistent performance gains, so we adopted it across all subsequent ablations.
Having revised the MLP design, we considered GQA, MLA, and gated sigmoid attention as alternatives to the multi-head attention baseline. We find that GQA introduces minimal degradation while offering improved computational efficiency. We also explored MLA and observed slight gains over GQA, but did not adopt it, as it introduced additional computational overhead. We used MLA with up/down projection for KV compression and without decoupled RoPE, since diffusion is purely prefill and does not use a KV cache at inference.
On top of GQA, we add gated sigmoid attention, following Gated Attention for Large Language Models. Gated sigmoid attention adds very little compute and parameter overhead. While it did not yield significant performance gains, it produced more stable training dynamics, as reflected in the loss and gradient-norm curves throughout training.
We also ablate the modality-stream design:
Single-stream design: a standard transformer block in which the attention and MLP weights are shared between text and image tokens.
Dual-stream design: joint attention with separate attention and MLP weights for text and image tokens.
Hybrid-stream design: a mix of the two, using dual-stream blocks for the first third of the network and single-stream blocks for the remaining two-thirds.
We did not observe significant performance differences among the three designs, with the exception of the hybrid-stream design, which slightly outperformed the others. For the sake of simplicity, however, we use single-stream blocks in our final architecture.
Timestep conditioning
Many MMDiTs use a per-block MLP to produce scale, shift, and gate factors. These MLP blocks can account for 20—30% of the total parameter count, which we consider excessive for injecting a scalar condition. We therefore replace the per-block MLP with a per-block tunable bias term. This change allows us to allocate more parameters to the attention and MLP layers without sacrificing model performance.
Beyond AdaLN modulation, we explored two alternatives: (1) removing timestep conditioning entirely, and (2) in-context timestep conditioning via timestep tokens. In our low-resolution pretraining runs, removing timestep information entirely consistently underperformed the AdaLN baseline. For in-context conditioning, we create time embeddings using sinusoidal embeddings, concatenate them into a unified text + image + time sequence, and remove the AdaLN layers entirely. At 256px pretraining, 4—16 timestep tokens were sufficient to replace AdaLN. At 512px and 1024px, however, in-context conditioning performed poorly relative to the AdaLN baseline. We attempted to mitigate this by increasing the number of timestep tokens, but observed diminishing returns and could not achieve competitive performance at higher resolutions.
Positional encoding
We implemented several RoPE schemes for our ablations. We use 3D axial RoPE, with head dimensions dedicated to frame, height, and width. For text tokens, we set the RoPE indices to zero. At low resolution, we did not observe significant gains from switching to Golden Gate RoPE, MRoPE, normalized RoPE, or partial RoPE. For partial RoPE, we rotate only the first half of the head dimension and leave the remainder unrotated. As expected, partial RoPE produced better zero-shot inference results when scaling the model from 256px to 512px and did not suffer from the common duplication artifacts. Despite this initial resolution generalization, partial RoPE ultimately performed worse than the baseline RoPE setting as high-resolution training continued.
Autoencoder
Recent work suggests that the latent-space design of the autoencoder can significantly accelerate the training of image generation models. We start from the FLUX.1-dev autoencoder as a baseline and benchmark it against the Qwen Image VAE, DC-AE, FLUX 2 VAE, and our internal autoencoder. We initially tested the DC-AE series, as it offers up to 32x spatial compression, which can substantially benefit both training and inference efficiency. However, we found that DC-AE imposes a hard upper limit on the diffusion model’s ability to resolve fine detail, owing to its reconstruction error.
By contrast, the Qwen Image VAE and FLUX 2 VAE offer a latent space with significantly faster convergence across our pretraining ablations while maintaining excellent reconstruction quality. We therefore initially used the Qwen Image autoencoder to scale our early models and later adopted the FLUX 2 VAE for our larger models. We also briefly explored training an internal autoencoder using DINOv3 for semantic alignment together with a light diffusion loss, following an approach similar to REPA-E. We validated that it performs competitively with the Qwen Image autoencoder, but owing to time constraints we opted for the Qwen Image and FLUX 2 VAEs, which have been validated at scale.
Residual design
We use standard residual connections as our default. We briefly experimented with Laurel, which improves the expressivity of the residual connection by adding a low-rank bottleneck branch, but observed no noticeable improvement. For future models, we intend to explore alternatives such as NOBLE, delta attention residuals, and mHC to improve the residual design of diffusion transformers.
Normalization
RMSNorm has become a standard component of LLM architectures but has not been fully integrated into recent diffusion transformer architectures. Starting from a LayerNorm baseline, we replaced all normalization layers with RMSNorm and observed very little quality degradation. We therefore use RMSNorm as the default normalization module (for example, for prenorm and QKNorm). We use the zero-centered RMSNorm and apply weight decay to its learnable parameters. We also experimented with more efficient variants such as Derf, but found non-negligible quality degradation.
Text encoder
We used T5-XXL as our baseline text encoder. From the outset, we deliberately chose to keep the architecture simple and use a single text encoder. Notably, we find that T5-XXL remains a very competitive text encoder relative to T5Gemma, umT5, Qwen 2.5 VL, and Qwen 3 VL. Ultimately, we use Qwen 3 VL as our final text encoder, as a VLM offers a richer input space (text and image) and stronger multilingual generalization.
Furthermore, inspired by Unifusion, rather than taking the last layer of the VLM features, we introduce a shallow attention layer that aggregates hidden features across layers. This design allows the model to dynamically select coarse-to-fine text representations. The last-layer features of an autoregressive LLM are suboptimal for our purpose, as they are optimized for next-token prediction rather than image generation. Alongside this layerwise feature aggregation, we add lightweight bidirectional transformer layers across the token axis to reduce the autoregressive bias in the representation space.
Optimization
We use AdamW as our primary optimizer throughout the pipeline. We initially saw mixed results applying Muon to the MMDiT architecture. By default, we use the Muon implementation from Dion and the RMS-matched setting from Moonlight to transfer AdamW hyperparameters.
In our exploration, Muon converged faster than AdamW in the initial steps but underperformed it over longer horizons. We also encountered a number of stability issues with Muon, including frequent loss and gradient-norm spikes throughout training. We found it crucial to exclude the first and last linear layers of the MMDiT from the Muon parameters; this is consistent with the LLM literature, where embedding and LM-head parameters are excluded from Muon. After excluding these layers and adding Nesterov momentum, Muon consistently outperformed the AdamW baseline at both low and high resolution. We did not adopt Muon for our most recent pretraining run owing to time constraints, but given these strong results we plan to adopt it in our next pretraining cycle.
Training
Our training pipeline follows a multi-stage structure inspired by modern LLM training pipelines.
Pretraining
Pretraining establishes the model’s basic capabilities, including text-image alignment, text rendering, stylistic coverage, and structural consistency. We progressively scale the resolution from 256px to 512px to 1024px. For our final model, we train with the standard rectified-flow loss under v-parameterization. To accelerate the early stages, we use iREPA for the first epoch of the 256px stage and then remove it, which encourages the MMDiT to learn its own representations while substantially speeding up initial convergence. We also explored alternative acceleration strategies such as TREAD, but saw little benefit.
During the 256px and 512px stages, we use 8-bit training and observe 15—20% gains in training speed over a bf16 baseline, with very minimal degradation in training loss and evaluation metrics. At 256px we use 8-bit training with tensorwise scaling, and at 512px we use finer-grained rowwise scaling. From 1024px onward, and through the final RL stage, we use standard bf16 training.
Another important aspect of high-resolution pretraining is adapting the resolution-dependent timeshift schedule. We use a shifted logit-normal sampling schedule for both training and inference, and gradually increase the shift as resolution increases. Following FLUX 2 VAE blog, we sweep for the optimal training timeshift at each resolution. We sweep the shift only for training and keep the inference shift schedule constant, as certain autoencoders are less sensitive to the inference timeshift.
During pretraining, we use a warmup-stable-decay learning-rate schedule and apply PMA following Model Merging in Pre-training of Large Language Models. We validate that PMA achieves performance comparable to EMA while avoiding its significant memory overhead. We do not observe significant differences between merging methods, although tuning the number of merged checkpoints and the merge interval can yield slight gains on downstream metrics.
Midtraining
Midtraining has become common in the LLM literature, and we incorporate an analogous stage into our pipeline. Its focus is to warm up the model’s distribution before the supervised finetuning (SFT) stage. We find that midtraining is typically the last point in the pipeline at which we can equip the model with downstream capabilities such as high-fidelity, high-resolution generation, strong domain coverage, and text rendering.
Supervised finetuning (SFT)
In the supervised finetuning (SFT) stage, we curate a small, dedicated set of highly aesthetic images. The objective is to further bias the model toward aesthetically desirable directions. We find this stage particularly helpful for improving overall checkpoint quality and for addressing the high-saturation and texture issues that are prevalent in earlier checkpoints.
After training domain-specific SFT checkpoints, we use model merging to produce a generalist SFT checkpoint. Model merging yields diminishing returns toward the later stages of the pipeline, as the directions of improvement begin to conflict across checkpoints.
Preference optimization (PO)
Preference optimization (PO) is the first stage of our post-training stack and consists of a two-stage pipeline. In the first stage, we run a large-scale synthetic preference-pair generation pipeline for initial refinement, using a strategy similar to delta learning; we ensure that the majority of pairs include at least one on-policy sample. The second stage is a calibration stage that uses only human annotations. These annotations are collected entirely in house, by people familiar with the specific strengths, weaknesses, and quirks of the model.
A common phenomenon during PO is policy divergence. At a high level, preference-optimization methods such as DPO encourage the model to increase the margin between its likelihood of generating a preferred sample and that of generating a dispreferred one, relative to the reference model. In practice, across different preference-dataset mixtures, we observe that the model achieves this objective by decreasing the likelihood of generating both samples, but at different rates. This would be desirable if both the winning and losing samples were of lower quality than the current model distribution, but that assumption does not always hold, depending on how the preference set was curated. Moreover, this divergence drifts the model away from the general pretraining distribution, which manifests as high-frequency artifacts in the later stages of training. To mitigate this, we designed a variant of DPO, which we call STPO, that adds an auxiliary loss and a modification to the original DPO formulation in order to reduce this divergence.
Reinforcement learning (RL)
Reinforcement learning (RL) is the final stage of the training pipeline. We use a multi-reward GRPO-style method with several reward models: (1) a general aesthetic model, (2) a prompt-following reward , (3) a text-rendering reward, (4) an artifact and structure reward. The general aesthetic model is obtained by finetuning an open-source VLM on the preference data collected during the PO stage. We carefully design the reward structure and tune the data mixture to prevent artifacts introduced by reward hacking.
Unlike general aesthetic rewards, which are inherently subjective, prompt following and text rendering provide more concrete signals because they can be checked against the user’s stated intent. The challenge is that this intent varies widely across prompts. To handle this, we use a prompt-specific rubric reward inspired by rubric-based evaluation in LLM training. Instead of asking a judge model for a single holistic score, we decompose each prompt into verifiable requirements and evaluate the generated image against them. This gives the RL stage a more structured signal for alignment with user intent, making the model better at satisfying fine-grained prompt constraints without reducing prompt following to generic image quality.
We also found that optimizing only for aesthetics and prompt following can lead to reward hacking. The model may learn to produce images that appear plausible at first glance while containing structural artifacts such as extra fingers, malformed limbs, or distorted text. These failures are visually obvious to humans but are often missed by general-purpose VLM judges. To address this, we train a dedicated artifact reward model that detects these structural errors and discourages the RL stage from improving benchmark-facing signals at the expense of visual correctness.
During the RL stage, we find that success depends not only on the quality of the reward models, but also on how efficiently training compute is allocated across prompts. Reward models define the direction of improvement, while the prompt pool determines where the model receives useful learning signal. We therefore curate a broad pool of prompts spanning diverse styles, concepts, settings, and subjects, then continuously analyze the reward statistics of generated groups to identify which prompts are most informative. Prompts that are already too easy, consistently too hard, or produce little variance across samples contribute limited signal and are deprioritized or removed. In practice, effective RL requires treating prompt selection as a resource-allocation problem, where the training process should spend more compute on examples where the model can still learn, and less on examples that provide saturated or noisy feedback.
Another practical consideration in diffusion RL is how to handle classifier-free guidance (CFG). Both rollout generation and training can be performed with or without CFG, and different choices create different trade-offs between alignment, stability, and efficiency. After ablations, we found it important to keep the rollout and training distributions aligned while avoiding unnecessary computational overhead. We therefore train the whole RL stage without CFG. This setting quickly improves the conditional model distribution, bringing no-CFG samples much closer to guided samples early in training. At inference time, CFG can still be enabled as an additional control knob, further improving quality when desired.
Timestep distillation
After the RL stage, we include an optional timestep-distillation stage in which we apply guidance distillation and timestep distillation simultaneously. We considered several distillation techniques, including DMD, DMD2, Decoupled DMD, piFlow, and APT, but adopted Trajectory Distribution Matching (TDM) for the following reasons. We sought a technique that was simple to tune, with minimal hyperparameters, which ruled out GAN-based methods and piFlow (the latter requires adapting the model into a multi-timestep prediction model). We chose TDM because it provides a fast, data-free method with flexible multistep distillation.
DMD distills the teacher by matching the distributions of real and generated samples over the clean-image distribution. Accordingly, standard DMD uses a few-step student to predict a clean image and then renoises the prediction to train the student (see figure above). Unlike DMD, which matches only the clean-image distribution, TDM applies DMD across timesteps, effectively performing distribution matching at the trajectory level rather than at the sample level. Since our goal was a flexible multistep student, we found TDM to be the most suitable method for our use case.
Prompt Expansion
Dense prompts reliably produce better image-generation results, but users rarely write prompts that resemble the rich captions used during training. We frame this as a distribution-mapping problem: the image model is best conditioned on detailed captions that lie close to its training distribution, while real user prompts are often short, conversational, and underspecified. We therefore develop a prompt expander that interprets user intent and maps an input prompt into a richer, model-friendly caption.