[Stable Diffusion] Wrap the UNet under a for-loop in the ONNX graph #668

michaelbenayoun · 2023-01-04T17:10:17Z

What does this PR do?

This a ONNX graph transformation that wraps the Unet ONNX graph in a for-loop to perform the iteration steps during generation directly in the ONNX graph.

Sample code:

First export the Unet model:

optimum-cli export onnx --model hf-internal-testing/tiny-stable-diffusion-torch --task stable-diffusion stable_diffusion

The tiny, non-representative model, hf-internal-testing/tiny-stable-diffusion-torch was used to iterate fast during development.

Then perform the transformation:

import numpy as np
import onnx
from onnxruntime import InferenceSession
from optimum.onnx.graph_transformations import embedd_loop_in_unet

path_unet = "stable_diffusion/unet/model.onnx"
path_unet_with_loop = "unet_with_loop.onnx"

unet_model = onnx.load(path_unet)
unet_with_loop = embedd_loop_in_unet(unet_model)
onnx.save(unet_with_loop, path_unet_with_loop)

unet_sess = InferenceSession(path_unet, providers=["CPUExecutionProvider"])
unet_with_loop_sess = InferenceSession("unet_with_loop.onnx", providers=["CPUExecutionProvider"])


batch_size = 16
initial_sample = np.random.uniform(size=(batch_size, 4, 32, 32)).astype(np.float32)
hidden_states = np.random.uniform(size=(batch_size, 25, 32)).astype(np.float32)

def outside_loop(iterations):
    sample = initial_sample
    for i in range(iterations):
        timestep = np.array([i] * batch_size)
        out  = unet_sess.run(
            ["out_sample"],
            {
                "sample": sample, 
                "timestep": timestep,
                "encoder_hidden_states": hidden_states
            }
        )
        sample = out[0]
    return sample

def embedded_loop(iterations):
    num_iterations = np.array(iterations)
    outputs = unet_with_loop_sess.run(
        ["sample"], 
        {
            "num_iterations": num_iterations,
            "initial_sample": initial_sample, 
            "encoder_hidden_states": hidden_states
        }
    )
    outputs = outputs[0]
    return outputs

outputs = outside_loop(10)
outputs_with_loop = embedded_loop(10)

print(np.max(np.abs(outputs - outputs_with_loop)))

TODO

Outputs are different with CUDAExecutionProvider
Lots of removed initializers for the transformed graph by ONNX Runtime, why??
Benchmark => On CPU and GPU with the tiny model, no speedup, need to try on real size models.
Support for schedulers: currently the transformed model iterates linearly on the steps, but the timesteps are usually computed using a scheduler, this needs to be handled.

HuggingFaceDocBuilderDev · 2023-01-04T17:30:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

[WIP] Wrap the UNet under a for-loop in the ONNX graph

447ab62

echarlaix added the onnx Related to the ONNX export label Jun 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Stable Diffusion] Wrap the UNet under a for-loop in the ONNX graph #668

[Stable Diffusion] Wrap the UNet under a for-loop in the ONNX graph #668

Uh oh!

michaelbenayoun commented Jan 4, 2023 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jan 4, 2023

Uh oh!

Uh oh!

[Stable Diffusion] Wrap the UNet under a for-loop in the ONNX graph #668

Are you sure you want to change the base?

[Stable Diffusion] Wrap the UNet under a for-loop in the ONNX graph #668

Uh oh!

Conversation

michaelbenayoun commented Jan 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

TODO

Uh oh!

HuggingFaceDocBuilderDev commented Jan 4, 2023

Uh oh!

Uh oh!

michaelbenayoun commented Jan 4, 2023 •

edited

Loading