Skip to content

[Stable Diffusion] Wrap the UNet under a for-loop in the ONNX graph #668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

michaelbenayoun
Copy link
Member

@michaelbenayoun michaelbenayoun commented Jan 4, 2023

What does this PR do?

This a ONNX graph transformation that wraps the Unet ONNX graph in a for-loop to perform the iteration steps during generation directly in the ONNX graph.

Sample code:

  1. First export the Unet model:
optimum-cli export onnx --model hf-internal-testing/tiny-stable-diffusion-torch --task stable-diffusion stable_diffusion

The tiny, non-representative model, hf-internal-testing/tiny-stable-diffusion-torch was used to iterate fast during development.

  1. Then perform the transformation:
import numpy as np
import onnx
from onnxruntime import InferenceSession
from optimum.onnx.graph_transformations import embedd_loop_in_unet

path_unet = "stable_diffusion/unet/model.onnx"
path_unet_with_loop = "unet_with_loop.onnx"

unet_model = onnx.load(path_unet)
unet_with_loop = embedd_loop_in_unet(unet_model)
onnx.save(unet_with_loop, path_unet_with_loop)

unet_sess = InferenceSession(path_unet, providers=["CPUExecutionProvider"])
unet_with_loop_sess = InferenceSession("unet_with_loop.onnx", providers=["CPUExecutionProvider"])


batch_size = 16
initial_sample = np.random.uniform(size=(batch_size, 4, 32, 32)).astype(np.float32)
hidden_states = np.random.uniform(size=(batch_size, 25, 32)).astype(np.float32)

def outside_loop(iterations):
    sample = initial_sample
    for i in range(iterations):
        timestep = np.array([i] * batch_size)
        out  = unet_sess.run(
            ["out_sample"],
            {
                "sample": sample, 
                "timestep": timestep,
                "encoder_hidden_states": hidden_states
            }
        )
        sample = out[0]
    return sample

def embedded_loop(iterations):
    num_iterations = np.array(iterations)
    outputs = unet_with_loop_sess.run(
        ["sample"], 
        {
            "num_iterations": num_iterations,
            "initial_sample": initial_sample, 
            "encoder_hidden_states": hidden_states
        }
    )
    outputs = outputs[0]
    return outputs

outputs = outside_loop(10)
outputs_with_loop = embedded_loop(10)

print(np.max(np.abs(outputs - outputs_with_loop)))

TODO

  • Outputs are different with CUDAExecutionProvider
  • Lots of removed initializers for the transformed graph by ONNX Runtime, why??
  • Benchmark => On CPU and GPU with the tiny model, no speedup, need to try on real size models.
  • Support for schedulers: currently the transformed model iterates linearly on the steps, but the timesteps are usually computed using a scheduler, this needs to be handled.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@echarlaix echarlaix added the onnx Related to the ONNX export label Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
onnx Related to the ONNX export
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants