Fix ErnieImagePipeline pre-computed prompt_embeds + num_images_per_prompt shape mismatch (#13532)

Ricardo-M-L · web-flow · commit 0fff459d1f95 · 2026-04-29T18:52:48.000-10:00
Fix ErnieImagePipeline pre-computed prompt_embeds + num_images_per_prompt

When a user passes pre-computed `prompt_embeds` (or `negative_prompt_embeds`)
alongside `num_images_per_prompt &gt; 1`, `ErnieImagePipeline.__call__` did
not replicate the provided embeddings — the embeds list kept its original
length (one per prompt) while the latents were allocated with
`total_batch_size = batch_size * num_images_per_prompt`:

    text_hiddens = prompt_embeds            # length = batch_size (NOT replicated)
    ...
    latents = randn_tensor((total_batch_size, ...))   # batch * N in shape

In the denoise loop `text_bth.shape[0]` then mismatches
`latent_model_input.shape[0]`, so the transformer call:

    pred = self.transformer(
        hidden_states=latent_model_input,   # (batch*N*2, ...) under CFG
        text_bth=text_bth,                  # (batch*2, ...)
        ...
    )

fails with a shape mismatch inside the attention block. The standard
"pre-compute embeds once, generate N variants" usage pattern is broken.

`encode_prompt` already performs this replication internally
(`for _ in range(num_images_per_prompt): text_hiddens.append(hidden)`
at lines 158-160), so the non-embed path is unaffected — this only
impacts callers of the documented `prompt_embeds` / `negative_prompt_embeds`
arguments.

Mirror the replication logic in the pre-embed branches so both paths
yield a `text_hiddens` list of length `batch_size * num_images_per_prompt`.
diff --git a/src/diffusers/pipelines/ernie_image/pipeline_ernie_image.py b/src/diffusers/pipelines/ernie_image/pipeline_ernie_image.py
@@ -286,14 +286,14 @@ def __call__(
 
         # [Phase 2] Text encoding
         if prompt_embeds is not None:
-            text_hiddens = prompt_embeds
+            text_hiddens = [h for h in prompt_embeds for _ in range(num_images_per_prompt)]
         else:
             text_hiddens = self.encode_prompt(prompt, device, num_images_per_prompt)
 
         # CFG with negative prompt
         if self.do_classifier_free_guidance:
             if negative_prompt_embeds is not None:
-                uncond_text_hiddens = negative_prompt_embeds
+                uncond_text_hiddens = [h for h in negative_prompt_embeds for _ in range(num_images_per_prompt)]
             else:
                 uncond_text_hiddens = self.encode_prompt(negative_prompt, device, num_images_per_prompt)