SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Muyang Li*, Yujun Lin*, Zhekai Zhang*, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han
MIT, NVIDIA, CMU, Princeton, UC Berkeley, SJTU, Pika Labs
(* indicates equal contribution)

News

Waiting for more news.

Awards

No items found.

Competition Awards

No items found.

Abstract

Diffusion models have been proven highly effective at generating high-quality images. However, as these models grow larger, they require significantly more memory and suffer from higher latency, posing substantial challenges for deployment. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where conventional post-training quantization methods for large language models like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first consolidate the outliers by shifting them from activations to weights, then employ a high-precision low-rank branch to take in the weight outliers with Singular Value Decomposition (SVD). This process eases the quantization on both sides. However, naïvely running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels of the low-rank branch into those of the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without the need for re-quantization. Extensive experiments on SDXL, PixArt-∑, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5×, achieving 3.0× speedup over the 4-bit weight-only quantized baseline on the 16GB laptop 4090 GPU, paving the way for more interactive applications on PCs. Our quantization library and inference engine are open-sourced.

Demo

Overview

SVDQuant is a post-training quantization technique for 4-bit weights and activations that well maintains visual fidelity. On 12B FLUX.1-dev, it achieves 3.6× memory reduction compared to the BF16 model. By eliminating CPU offloading, it offers 8.7× speedup over the 16-bit model when on a 16GB laptop 4090 GPU, 3× faster than the NF4 W4A16 baseline. On PixArt-∑, it demonstrates significantly superior visual quality over other W4A4 or even W4A8 baselines. "E2E" means the end-to-end latency including the text encoder and VAE decoder.

Method

SVDQuant: Absorbing Outliers via Low-Rank Branch

Overview of SVDQuant
LaTeX Rendering Example

Stage 1: Originally, both the activation \( \boldsymbol{X} \) and weights \( \boldsymbol{W} \) contain outliers, making 4-bit quantization challenging. Stage 2: We migrate the outliers from activations to weights, resulting in the updated activation \( \hat{\boldsymbol{X}} \) and weights \( \hat{\boldsymbol{W}} \). While \( \hat{\boldsymbol{X}} \) becomes easier to quantize, \( \hat{\boldsymbol{W}} \) now becomes more difficult. Stage 3: SVDQuant further decomposes \( \hat{\boldsymbol{W}} \) into a low-rank component \( \boldsymbol{L}_1 \boldsymbol{L}_2 \) and a residual \( \hat{\boldsymbol{W}} - \boldsymbol{L}_1 \boldsymbol{L}_2 \) with SVD. Thus, the quantization difficulty is alleviated by the low-rank branch, which runs at 16-bit precision.

Nunchaku: Fusing Low-Rank and Low-Bit Branch Kernels

Nunchaku Engine Design

(a) Naïvely running low-rank branch with rank 32 will introduce 57% latency overhead due to extra read of 16-bit inputs in Down Projection and extra write of 16-bit outputs in Up Projection. Nunchaku optimizes this overhead with kernel fusion. (b) Down Projection and Quantize kernels use the same input, while Up Projection and 4-Bit Compute kernels share the same output. To reduce data movement overhead, we fuse the first two and the latter two kernels together.

Performance

SVDQuant reduces the model size of the 12B FLUX.1 by 3.6×. Additionally, Nunchaku further cuts memory usage of the 16-bit model by 3.5× and delivers 3.0× speedups over the NF4 W4A16 baseline on both the desktop and laptop NVIDIA RTX 4090 GPUs. Remarkably, on laptop 4090, it achieves in total 10.1× speedup by eliminating CPU offloading.

Integrate with LoRA

SVDQuant seamlessly integrates with off-the-shelf LoRAs without requiring re-quantization. When applying LoRAs, it matches the image quality of the original 16-bit FLUX.1-dev.

Video

Citation

@article{
 li2024svdquant,
 title={SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models},
 author={Li*, Muyang and Lin*, Yujun and Zhang*, Zhekai and Cai, Tianle and Li, Xiuyu and Guo, Junxian and Xie, Enze and Meng, Chenlin and Zhu, Jun-Yan and Han, Song},  

journal={arXiv preprint arXiv:2411.05007},
 year={2024}
}

Media

No media articles found.

Acknowledgment

We thank MIT-IBM Watson AI Lab, MIT and Amazon Science Hub, MIT AI Hardware Program, National Science Foundation, Packard Foundation, Dell, LG, Hyundai, and Samsung for supporting this research. We thank NVIDIA for donating the DGX server.

Team Members