stable-diffusion-2-depth training problems #11925

Henry-Bi · 2025-07-15T02:46:36Z

Henry-Bi
Jul 15, 2025

Hi everyone, I have a technical question regarding the fine-tuning process of the stable-diffusion-2-depth model, and I'd love to hear your insights.
The model card states:
"This stable-diffusion-2-depth model is resumed from stable-diffusion-2-base (512-base-ema.ckpt) and finetuned for 200k steps. Added an extra input channel to process the (relative) depth prediction produced by MiDaS (dpt_hybrid) which is used as an additional conditioning." This implies that the input channels for self.conv_in layer were increased from 4 (for the latent) to 5 (for the latent + depth map). How can one "resume" fine-tuning from a checkpoint when the model's architecture has been modified?

Answered by asomoza

Jul 16, 2025

I think you're in the right track, don't really know if MM-DiT models are good for super resolution tasks, I don't use them normally because they're really slow and resource hungry, upscaling with them is even slower and you need a very high end GPU to be able to do it, and the benefits right now aren't that great to justify it.

What I do know is that the current SOTA model for super resolution is called SUPIR and it's based in SDXL, so a U-Net. Also there is a model called Stable Cascade that worked really well with a small latent and in the second stage upscale it with a second U-Net.

The other solution right now is to train or use a Tile ControlNet and do a tiled img2img over an image …

View full answer

asomoza · 2025-07-15T20:18:54Z

asomoza
Jul 15, 2025
Maintainer

Hi, the only people that can do a real resume the training are the ones that did it originally since they have the weights state and the optimizer state, just to clarify, what you're going to do is a finetune of a finetune.

Since the model is already supported by diffusers you just need to adapt the training script you want to use SD2, sadly that model architecture wasn't popular or really used so I don't think we have any training scripts for it.

When you load the model with that repo id the model will be configured accordingly to accept the depth map, you just need to adapt the training script to manage the depth maps for the training.

5 replies

Henry-Bi Jul 16, 2025
Author

Thanks for your reply, @asomoza , and for the clarification! So, the fine-tuning likely focused heavily on adapting the self.conv_in layer for depth-aware image generation, while leveraging the immense capability of the pre-existing layers. Even so, what I find incredible is that they achieved such excellent spatial consistency and quality in just 200k fine-tuning steps.

asomoza Jul 16, 2025
Maintainer

yeah, personally I prefer controlnets because they allow to switch the base model so they're more practical. I didn't test the SD2 one so I don't know how good it is, but the good controlnets for SD and SDXL were trained with 1m-3m images, so there's definitely a difference in the number of training steps.

Henry-Bi Jul 16, 2025
Author

Thanks again! Our team is focused on super-resolution tasks. For SD2-based models, using a ControlNet architecture works quite well for us. We can achieve good results because, in my opinion, the U-Net architecture is inherently well-suited for image-guided generation. However, we're finding the MM-DiT architecture in SD3 to be more complex for this kind of deterministic enhancement. My current perspective is that SD2's U-Net is more strongly biased towards an image-to-image pipeline, whereas MM-DiT seems to weigh text and image conditioning more symmetrically. Because of this, when we try to apply it to a deterministic task like super-resolution, it's difficult to get consistent enhancement results. Do you have any advice or suggestions about this topic?

asomoza Jul 16, 2025
Maintainer

I think you're in the right track, don't really know if MM-DiT models are good for super resolution tasks, I don't use them normally because they're really slow and resource hungry, upscaling with them is even slower and you need a very high end GPU to be able to do it, and the benefits right now aren't that great to justify it.

What I do know is that the current SOTA model for super resolution is called SUPIR and it's based in SDXL, so a U-Net. Also there is a model called Stable Cascade that worked really well with a small latent and in the second stage upscale it with a second U-Net.

The other solution right now is to train or use a Tile ControlNet and do a tiled img2img over an image upscaled with an ESRGAN model, I mostly do this with SDXL but it also worked well with SD 1.5. I haven't really tested this with MM-DiT models, but there's an upscaler ControlNet for Flux (this is just a Tile ControlNet that they named different though).

Answer selected by Henry-Bi

Henry-Bi Jul 17, 2025
Author

Thank you very much for your patient and detailed reply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

stable-diffusion-2-depth training problems #11925

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

stable-diffusion-2-depth training problems #11925

Uh oh!

Henry-Bi Jul 15, 2025

Replies: 1 comment · 5 replies

Uh oh!

asomoza Jul 15, 2025 Maintainer

Uh oh!

Henry-Bi Jul 16, 2025 Author

Uh oh!

asomoza Jul 16, 2025 Maintainer

Uh oh!

Henry-Bi Jul 16, 2025 Author

Uh oh!

Uh oh!

asomoza Jul 16, 2025 Maintainer

Uh oh!

Henry-Bi Jul 17, 2025 Author

Henry-Bi
Jul 15, 2025

Replies: 1 comment 5 replies

asomoza
Jul 15, 2025
Maintainer

Henry-Bi Jul 16, 2025
Author

asomoza Jul 16, 2025
Maintainer

Henry-Bi Jul 16, 2025
Author

asomoza Jul 16, 2025
Maintainer

Henry-Bi Jul 17, 2025
Author