In this blog, we introduce DistriFusion, a training-free algorithm to harness multiple GPUs to accelerate diffusion model inference without sacrificing image quality. It can reduce SDXL latency by up to 6.1× on 8 A100s. Our work has been accepted by CVPR 2024 as a highlight.
The advent of AI-generated content marks a seismic technological leap, with tools like Stable Diffusion, Adobe Firefly, Midjourney, and Sora transforming text prompts into striking visuals of high-resolution quality, thanks to advancements in diffusion models. This revolution unlocks numerous synthesis and editing applications for images and videos, demanding more responsive interaction between users and models to customize outputs precisely. However, As Sora shows us, the future of diffusion models will be high resolution, which poses significant challenges in computation and latency, presenting a tremendous barrier to real-time applications.
Recent efforts to accelerate diffusion model inference mainly focus on reducing sampling steps and optimizing neural network inference. As computational resources grow rapidly, leveraging multiple GPUs to speed up inference is appealing. In the natural language processing domain, tensor parallelism across GPUs significantly cuts down latency. However, it is inefficient for diffusion models due to the high communication costs of large activations that negate the benefits of distributed computing. Beyond tensor parallelism, are there alternative strategies for distributing workloads across multiple GPU devices so that single-image generation can also enjoy the free-lunch speedups from multiple devices?
A naïve approach would be to divide the image into several patches, assigning each patch to a different device for generation, as illustrated in the below figure (a). This method allows for independent and parallel operations across devices. However, it suffers from a clearly visible seam at the boundaries of each patch due to the absence of interaction between the individual patches. However, introducing interactions among patches to address this issue would incur excessive synchronization costs again, offsetting the benefits of parallel processing.
In this blog, we present DistriFusion, a method that enables running diffusion models across multiple devices in parallel to reduce latency without hurting image quality. As depicted in the above figure (b), our approach is also based on patch parallelism, which divides the image into multiple patches, each assigned to a different device. Our key observation is that the inputs across adjacent denoising steps in diffusion models are similar. Therefore, we adopt synchronous communication solely for the first step. For the subsequent steps, we reuse the pre-computed activations from the previous step to provide global context and patch interactions for the current step. We further co-design an inference framework to implement our algorithm. Specifically, our framework effectively hides the communication overhead within the computation via asynchronous communication. It also sparsely runs the convolutional and attention layers exclusively on the assigned regions, thereby proportionally reducing per-device computation. Our method, distinct from data, tensor, or pipeline parallelism, introduces a new parallelization opportunity: displaced patch parallelism. Please refer to our paper and code for my details.
In the above figure, we show some qualitative visual results of DistriFusion with the 50-step DDIM sampler on SDXL. Latency is measured on NVIDIA A100 GPUs. ParaDiGMS expends considerable computational resources on guessing future denoising steps, resulting in much higher total MACs (computation amounts). Besides, it also suffers from some performance degradation. In contrast, DistriFusion simply distributes workloads across multiple GPUs, maintaining a constant total computation. The Naïve Patch baseline, while lower in total MACs, lacks the crucial inter-patch interaction, leading to fragmented outputs. This limitation significantly impacts image quality, as reflected across all evaluation metrics. Our DistriFusion can well preserve interaction. Even when using 8 devices, it achieves comparable FID scores (quality metric, lower is better) comparable to those of the original model.
In the above figure, we show the total latency of DistriFusion with SDXL using the 50-step DDIM sampler for generating a single image across different resolutions on NVIDIA A100 GPUs. When generating 1024×1024 images, our speedups are limited by the low GPU utilization of SDXL. After scaling the resolution to 2048×2048 and 3840×3840, the GPU devices are better utilized. Specifically, for 3840x3840 images, DistriFusion reduces the latency by 1.8×, 3.4×, and 6.1× with 2, 4, and 8 A100s, respectively. Note that these results are benchmarked with PyTorch. With more advanced compilers, such as TVM and TensorRT, we anticipate even higher GPU utilization and consequently more pronounced speedups from DistriFusion, as observed in SIGE. In practical use, the batch size often doubles due to classifier-free guidance. We can first split the batch and then apply DistriFusion to each batch separately. This approach further improves the total speedups to 3.6× and 6.6× with 4 and 8 A100s for generating a single 3840×3840 image, respectively.
In the NLP domain, tensor parallelism (TP) is frequently used to deploy and accelerate the Large Language Models (LLMs), which are characterized by their substantial model sizes, whereas their activation sizes are relatively small. Conversely, diffusion models, while generally smaller than LLMs, are often bottlenecked by the large activation size due to the spatial dimensions, particularly when generating high-resolution content. This characteristic results in prohibitive communication costs for tensor parallelism, making it an impractical choice for diffusion models.
In the above table, we benchmark our latency with synchronous tensor parallelism (Sync. TP) and synchronous patch parallelism (Sync. PP), and report the corresponding communication amounts. Compared to TP, PP has better independence, which eliminates the need for communication within cross-attention and linear layers. For convolutional layers, communication is only required at the patch boundaries, which represent a minimal portion of the entire tensor. Moreover, PP utilizes AllGather over AllReduce, leading to lower communication demands and no additional use of computing resources. Therefore, PP requires 60% fewer communication amounts and is 1.6~2.1× faster than TP, making it a more efficient approach for deploying diffusion models.
We also include a theoretical PP baseline without any communication (No Comm.) to demonstrate the communication overhead in Sync. PP and DistriFusion. Compared to Sync. PP, DistriFusion further cuts such overhead by over 50%. The remaining overhead mainly comes from our current usage of NVIDIA Collective Communication Library (NCCL) for asynchronous communication. NCCL kernels use SMs (the computing resources on GPUs), which will slow down the overlapped computation. Using remote memory access can bypass this issue and close the performance gap.
In this blog, we introduce DistriFusion to accelerate diffusion models with multiple GPUs for parallelism. Our method divides images into patches, assigning each to a separate GPU. We reuse the pre-computed activations from previous steps to maintain patch interactions. On Stable Diffusion XL, our method achieves up to a 6.1× speedup on 8 NVIDIA A100s. This advancement not only enhances the efficiency of AI-generated content creation but also sets a new benchmark for future research in parallel computing for AI applications.