[wan2.2] follow-up #12024

yiyixuxu · 2025-07-30T20:34:17Z

to use only high-noise stage/transformer, set boundary_ratio to be 0
to use only low-noise stage/transformer_2, set boundary_ratio to be 1

Two-Stage Denoising Loop

boundary_ratio = 0.9 (90%)

Timestep:     1000 ─────── 900 ──────────────────────────────► 0
Noise Level:  High ──────────────────────────────────────────► Low
                              │
                  boundary_timestep (0.9 * 1000 = 900)
                              │
              ┌───────────────┼─────────────────────────────────┐
              │               │                                 │
         HIGH NOISE STAGE  BOUNDARY              LOW NOISE STAGE
        (t >= 900)            │                   (t < 900)
              │               │                                 │
        Uses: transformer     │               Uses: transformer_2
        Scale: guidance_scale │               Scale: guidance_scale_2
              │               │                                 │
              └───────────────┼─────────────────────────────────┘
                              │

boundary_ratio = 1.0 (100% - Single Stage)

Timestep: 1000 ────────────────────────────────────────────► 0 Noise Level: High ────────────────────────────────────────────► Low

boundary_timestep (1.0 * 1000 = 1000) │ │ ┌─────────────────────────────────────────────────┐ │ │ │ │ │ LOW NOISE STAGE (ONLY) │ │ │ │ │ │ Uses: transformer_2 (only) │ │ │ Scale: guidance_scale_2 (only) │ │ │ │ │ └─────────────────────────────────────────────────┘

Stage Breakdown

boundary_ratio = 0.9

Stage	Timestep Range	Model Used	Guidance Scale	Duration
High Noise	`t >= 900`	`transformer`	`guidance_scale`	~10% of steps
Low Noise	`t < 900`	`transformer_2`	`guidance_scale_2`	~90% of steps

boundary_ratio = 1.0

Stage	Timestep Range	Model Used	Guidance Scale	Duration
High Noise	`t >= 1000` (never true)	N/A	N/A	0% of steps
Low Noise (Only)	`t < 1000` (always true)	`transformer_2`	`guidance_scale_2`	100% of steps

# wan 2.2 5b i2v slow test (using only low-noise stage transformer)
import torch
import numpy as np
from diffusers import WanImageToVideoPipeline, ModularPipeline
from diffusers.utils import export_to_video, load_image
import time


model_id = "Wan-AI/Wan2.2-I2V-A14B-Diffusers"
dtype = torch.bfloat16
device = "cuda:1"

pipe = WanImageToVideoPipeline.from_pretrained(model_id, torch_dtype=dtype, transformer=None, boundary_ratio=1.0)
pipe.to(device)


image_processor = ModularPipeline.from_pretrained("YiYiXu/WanImageProcessor14B", trust_remote_code=True)
image = image_processor(
    image="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/wan_i2v_input.JPG",
    output="processed_image"
)


prompt = "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

negative_prompt = "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走"
generator = torch.Generator(device=device).manual_seed(0)

output = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=image.height,
    width=image.width,
    num_frames=81,
    guidance_scale=3.5,
    num_inference_steps=40,
    generator=generator,
).frames[0]


export_to_video(output, "output.mp4", fps=16)

HuggingFaceDocBuilderDev · 2025-07-30T20:41:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

JoeGaffney · 2025-07-31T10:19:55Z

Hey,

With them both being accessed in one loop would enable_model_cpu_offload still work. Like when it hits the transformer_2 it could offload transformer?

Cheers,
Joe

okaris · 2025-07-31T11:16:35Z

@JoeGaffney interesting question, i was trying to debug a similar issue with enable_model_cpu_offload yesterday and i think i will need to open an issue.

for wan2.2 i2v the execution path is
text_encoder->vae(encode)->transformer->transformer_2->vae(decode)

while the offload sequence is defined as

text_encoder->transformer->transformer_2->vae

for me it causes the text_encoder to stay on gpu after encoding the initial image, but also changing the sequence is problematic and currently it supports a single position. i know this deserves it's own issue, still collecting examples

coming back to your comment i think it's a valid argument that if the sequence is defined as transformer->transformer_2 and 2 never runs it might get stuck on the gpu.

JoeGaffney · 2025-07-31T12:51:46Z

Hey @okaris
I'm wondering if the behaviours have diverged enough between 2.1 and 2.2 that these should maybe be separate pipelines. I'm not an expert in authoring diffusers pipelines, but there seem to be significant changes in expectations and data flow. Maybe there are common components that could be lifted out to avoid duplication, while still reflecting the divergence.

Also, I think referencing something like this in the example:

 image_processor = ModularPipeline.from_pretrained("YiYiXu/WanImageProcessor14B", trust_remote_code=True)
 image = image_processor(
     image="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/wan_i2v_input.JPG",
     output="processed_image"
)

...feels a bit fragile. It relies on remote, opaque behavior. I get that it's convenient, but it’s not ideal for production integration or debugging. It would be great to also provide a minimal example that shows how to prepare inputs manually.

Cheers,
Joe

ukaprch · 2025-07-31T13:07:34Z

Some have mentioned that during their testing the 1st stage is relatively useless. Has anyone on the team actually done sufficient testing on WAN 2.2 to make a determination as to what approach works the best?

yiyixuxu · 2025-08-01T11:34:10Z

for me it causes the text_encoder to stay on gpu after encoding the initial image, but also changing the sequence is problematic and currently it supports a single position. i know this deserves it's own issue, still collecting examples

so, yes text_encoder will stay on gpu after encoding the initial image, but it will get offloaded when transformer is used - so it does not matter that it stayed on GPU since the memory bottleneck is transformer (i.e. it does not increasee overall memory requirement)

vae would stay on GPU until the it's used again, including the time when transformers are loaded - but it's really small relatively so does not make that much diffeerence. we could force offload vae though if it's needed

…o wan22-followup

* support lightx2v lora in wan * add docsa. * reviewer feedback * empty

mayankagrawal10198 · 2025-08-03T06:39:50Z

@yiyixuxu How much boundary_ratio I should keep for lightx2v lora where inference step is just 4
please refer this 9a2eaed

up

cdcac4a

yiyixuxu mentioned this pull request Jul 30, 2025

Wan 2.2 First vs Second stage #12019

Closed

up

5c995d9

Merge branch 'main' into wan22-followup

dd1328d

yiyixuxu added 4 commits August 2, 2025 02:09

make it work with batch_sie >1

2fd1e25

add tests

5e17dde

Merge branch 'wan22-followup' of github.com:huggingface/diffusers int…

67347f6

…o wan22-followup

tests

a7bce5f

mayankagrawal10198 referenced this pull request Aug 3, 2025

[LoRA] support lightx2v lora in wan (#12040)

9a2eaed

* support lightx2v lora in wan * add docsa. * reviewer feedback * empty

mayankagrawal10198 mentioned this pull request Aug 3, 2025

[LoRA] support lightx2v lora in wan #12040

Merged

add more tests

1189a35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[wan2.2] follow-up #12024

[wan2.2] follow-up #12024

yiyixuxu commented Jul 30, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jul 30, 2025

Uh oh!

JoeGaffney commented Jul 31, 2025

Uh oh!

okaris commented Jul 31, 2025

Uh oh!

JoeGaffney commented Jul 31, 2025

Uh oh!

ukaprch commented Jul 31, 2025

Uh oh!

yiyixuxu commented Aug 1, 2025

Uh oh!

mayankagrawal10198 commented Aug 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

[wan2.2] follow-up #12024

Are you sure you want to change the base?

[wan2.2] follow-up #12024

Conversation

yiyixuxu commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Two-Stage Denoising Loop

boundary_ratio = 0.9 (90%)

boundary_ratio = 1.0 (100% - Single Stage)

Stage Breakdown

boundary_ratio = 0.9

boundary_ratio = 1.0

Uh oh!

HuggingFaceDocBuilderDev commented Jul 30, 2025

Uh oh!

JoeGaffney commented Jul 31, 2025

Uh oh!

okaris commented Jul 31, 2025

Uh oh!

JoeGaffney commented Jul 31, 2025

Uh oh!

ukaprch commented Jul 31, 2025

Uh oh!

yiyixuxu commented Aug 1, 2025

Uh oh!

mayankagrawal10198 commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yiyixuxu commented Jul 30, 2025 •

edited

Loading

mayankagrawal10198 commented Aug 3, 2025 •

edited

Loading