Optimize Stable Diffusion Upscaling with Diffusers and PyTorch*

Stay in the Know on All Things CODE

author-image

By

Stable Diffusion is a state-of-the-art model for generating high-quality images from textual descriptions, leveraging the power of latent diffusion models. The Hugging Face diffusers library provides easy-to-use pipelines to deploy and utilize the Stable Diffusion model, including generating, modifying, and upscaling images.

In this article, we will delve into the process of upscaling images generated by Stable Diffusion using the StableDiffusionUpscalePipeline from the diffusers library. We will discuss the reasons behind upscaling and demonstrate how to optimize this process for better performance on Intel® Xeon® Processors using Intel® Extension for PyTorch* (a Python package where Intel releases its newest optimizations and features before upstreaming them into open source PyTorch).

How to optimize the StableDiffusionUpscalePipeline for Inference?

The StableDiffusionUpscalePipeline from the Hugging Face diffusers library is designed to enhance the resolution of input images using the Stable Diffusion model, particularly increasing the resolution by a factor of four. This pipeline utilizes a combination of components including a Variational Auto-Encoder (VAE) for encoding and decoding images, a frozen CLIP text model for text encoding, a UNet architecture for denoising image latents, and various schedulers to manage the diffusion process during image generation.

This pipeline is particularly useful for applications requiring high-resolution image outputs from lower resolution inputs, making it ideal for enhancing details in generated or real-world images. It allows users to specify various parameters such as the number of denoising steps, the guidance scale to balance fidelity to the input text against image quality, and even supports custom callbacks during the inference process to monitor or modify the generation.

For detailed examples of how to use this pipeline and configure its parameters for optimal results, check out the Hugging Face documentation and model hub.

Additionally, to boost the performance of the StableDiffusionUpscalePipeline, we can optimize its components individually before combining them. Intel Extension for PyTorch, plays a crucial role in this optimization. The extension enhances PyTorch with advanced optimizations for an additional performance increase on Intel hardware. These enhancements utilize the capabilities of Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Vector Neural Network Instructions (VNNI), and Intel® Advanced Matrix Extensions (Intel® AMX) within Intel CPUs. Intel Extension for PyTorch introduces an accessible Python API - `ipex.optimize()` which automatically optimizes the pipeline module, allowing it to leverage these sophisticated hardware instructions for greater performance efficiency.

Code Sample

The code snippet below demonstrates how to upscale an image using the Stable Diffusion Upscale Pipeline from the diffusers library, with optimizations for performance using Intel Extension for PyTorch. U-Net, VAE, and text encoder components of the pipeline are targeted separately and optimized for CPU inference.

1. Setting Up the Environment

It is recommended to create a Conda virtual environment to perform the installations. Install PyTorch, diffusers and Intel Extension for PyTorch:

python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
python -m pip install intel-extension-for-pytorch
python -m pip install oneccl_bind_pt --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/us/
pip install transformers
pip install diffusers

Check out the page to find more information about installing Intel Extension for PyTorch.

2. Steps to Optimize

First, let’s import all the necessary packages including Intel Extension for PyTorch and load the sample image that we wish to upscale:

from diffusers import StableDiffusionUpscalePipeline 
import torch 
from PIL import Image 

###########################################
import intel_extension_for_pytorch as ipex 
########################################### 

# Load an image from the filesystem 

img = Image.open("./sample.png")

Next, let’s look at how the upscaling pipeline can be optimized using features from Intel Extension for PyTorch.

# Initialize the upscaling pipeline with a pre-trained model 
pipeline = StableDiffusionUpscalePipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler") 
prompt = "HD, 4k, hyper realistic, extremely detailed, professional, vibrant, not grainy, smooth" 
 
# Convert model to channels last format for performance optimization 
pipeline.unet = pipeline.unet.to(memory_format=torch.channels_last) 
pipeline.vae = pipeline.vae.to(memory_format=torch.channels_last) 
pipeline.text_encoder = pipeline.text_encoder.to(memory_format=torch.channels_last) 
 
# Optimize the model components with IPEX for better CPU performance 
pipeline.unet = ipex.optimize(pipeline.unet.eval(), dtype=torch.bfloat16, inplace=True) 
pipeline.vae = ipex.optimize(pipeline.vae.eval(), dtype=torch.bfloat16, inplace=True) 
pipeline.text_encoder = ipex.optimize(pipeline.text_encoder.eval(), dtype=torch.bfloat16, inplace=True)

Each component of the pipeline is targeted separately to be optimized. First, we set the UNet, VAE and the text encoder to Channels Last format. Using channels last format orders tensor dimensions as batch, height, width, and channels. This arrangement is more efficient as it aligns better with certain memory access patterns, leading to performance improvements. Channels last is particularly advantageous in reducing the need for data reordering during operations, which can significantly boost processing speed for convolutional neural networks.

Similarly, each component is optimized by Intel Extension for PyTorch using `ipex.optimize()` with the data type set to BFloat16. Operations running on BFloat16 precision are optimized using Intel® AMX, available on 4th Gen Xeon Scalable Processors and above. Intel AMX is a built-in AI accelerator for lower precision data types such as BFloat16 and INT8 and can be enabled by using IPEX’s `optimize()` function.

# Perform the upscaling with autocasting for mixed precision 
with torch.cpu.amp.autocast(): 
    upscaled_image = pipeline(prompt=prompt, image=img, num_inference_steps=20, guidance_scale=0, generator=torch.manual_seed(33)).images[0] 
 
# Save the upscaled image with a unique timestamped filename 
filename = 'upscaled_img.png' 
upscaled_image.save(filename)

Finally, we can perform the upscaling optimally by using mixed precision which combines the computational speed and memory savings of lower precision arithmetic (like BF16) with the numerical stability of higher precision (like FP32). Setting `torch.cpu.amp.autocast()` automatically applies mixed precision to our pipeline. The resulting pipeline object is now optimized with Intel Extension for PyTorch and can be used to upscale images while also achieving low latency.

3. Advanced Environment Setup

This section shows how you can get additional performance boost by setting environment variables and configurations optimized for performance on Intel Xeon processors, especially for parallel computing and optimized memory management. The script env_activate.shsets a series of environment variables that are specific to the Intel OpenMP library. It also uses LD_PRELOAD to specify which shared libraries are loaded before others. The script dynamically builds the path to certain libraries, ensuring they are loaded at runtime before the application starts.

Steps to set up Advanced Environment for high performance on Intel Xeon processors:

# Install two packages that serve as dependencies to use the script
pip install intel-openmp
conda install -y gperftools -c conda-forge

git clone https://github.com/intel/intel-extension-for-pytorch.git
cd intel-extension-for-pytorch
git checkout v2.3.100+cpu

cd examples/cpu/inference/python/llm

# Activate environment variables
source ./tools/env_activate.sh
# Run a script with the code from the previous section
python run_upscaler_pipeline.py

After running the above steps, your environment is ready to run StableDiffusionUpscalePipeline that was optimized in the previous section for higher performance by setting performance flags. Additionally, inference using the Intel Extension for Pytorch optimized pipeline can provide extra performance.

Next Steps

Leverage the optimizations from Intel Extension for PyTorch and utilize the full capabilities of Intel's hardware innovations to enhance the performance of your AI applications. Download and try the AI Tools and Intel Extension for PyTorch for yourself to build various end-to-end AI applications.

We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations and tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.

For more details about 4th Gen Intel Xeon Scalable processors, visit Intel's AI Solution Platform portal where you can learn how Intel is empowering developers to run end-to-end AI pipelines on these powerful CPUs.

Useful resources

See PyTorch Related Content

Articles

Code Samples

See all code samples

Get the Software

Download Intel Extension for PyTorch as a part of the AI Tools Selector, or you can get its standalone version.