SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup

Muyang Li*, Yujun Lin*, Zhekai Zhang*, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han

November 7, 2024

TL;DR

A new post-training training quantization paradigm for diffusion models, which quantize both the weights and activations of FLUX.1 to 4 bits, achieving 3.5× memory and 8.7× latency reduction on a 16GB laptop 4090 GPU.

Related Projects

AWQ

SVDQuant

DistriFusion

SIGE

QServe

SmoothQuant

Check our interactive demo at https://mv-ezproxy-com.ezproxy.canberra.edu.au! Our quantization library is at github.com/mit-han-lab/deepcompressor and inference engine is at github.com/mit-han-lab/nunchaku. Our paper is at this link.

SVDQuant is a post-training quantization technique for 4-bit weights and activations that well maintains visual fidelity. On 12B FLUX.1-dev, it achieves 3.6× memory reduction compared to the BF16 model. By eliminating CPU offloading, it offers 8.7× speedup over the 16-bit model when on a 16GB laptop 4090 GPU, 3× faster than the NF4 W4A16 baseline. On PixArt-∑, it demonstrates significantly superior visual quality over other W4A4 or even W4A8 baselines.

Background

Computation *v.s.* parameters for LLMs and diffusion models.

Diffusion models are revolutionizing AI with their ability to generate high-quality images from text prompts. To improve image quality and improve the alignment between text and image, researchers are scaling up these models. As shown in the right figure, while Stable Diffusion 1.4 has 800 million parameters, newer models like AuraFlow and FLUX.1 reach billions, delivering more refined and detailed outputs. However, scaling brings challenges: these models become computationally heavy, demanding high memory and longer processing times, making them prohibitive for real-time applications.

As Moore's law slows down, hardware vendors are turning to low-precision inference, such as NVIDIA's new 4-bit floating point (FP4) precision in Blackwell. In large language models (LLMs), quantization has helped reduce model sizes and speed up inference, primarily by addressing latency from loading model weights. Diffusion models, however, are computationally bound, even for single batches, so quantizing weights alone yields limited gains. To achieve measured speedups, both weights and activations must be quantized to the same bit width; otherwise, the lower precision is upcast during computation, negating any performance benefits.

In this blog, we introduce SVDQuant to quantize both the weights and activations of diffusion models to 4 bits. At such an aggressive level, conventional post-training methods fall short. Unlike smoothing, which redistributes outliers, SVDQuant absorbs them through a high-precision low-rank branch, significantly preserving image quality. Visual examples demonstrate its effectiveness. See the above figure for some visual examples.

SVDQuant: Absorbing Outliers via Low-Rank Branch

LaTeX Rendering Example

The key idea behind SVDQuant is to introduce an additional low-rank branch that can absorb quantization difficulties in both weights and activations. As shown in the above animation, originally, both the activation \( \boldsymbol{X} \) and weights \( \boldsymbol{W} \) contain massive outliers, making 4-bit quantization challenging. We can first aggregate the outliers by migrating them from activations to weights via smoothing, resulting in the updated activation \( \hat{\boldsymbol{X}} \) and weights \( \hat{\boldsymbol{W}} \). While \( \hat{\boldsymbol{X}} \) becomes easier to quantize, \( \hat{\boldsymbol{W}} \) now becomes more difficult. At the last stage, SVDQuant further decomposes \( \hat{\boldsymbol{W}} \) into a low-rank component \( \boldsymbol{L}_1 \boldsymbol{L}_2 \) and a residual \( \hat{\boldsymbol{W}} - \boldsymbol{L}_1 \boldsymbol{L}_2 \) with Singular Value Decomposition (SVD). As the singular value distribution of \( \hat{\boldsymbol{W}} \) is highly imbalanced, with only the first several values being significantly larger, removing these dominant values can dramatically reduce \( \hat{\boldsymbol{W}} \)’s magnitude and outliers, as suggested by Eckart-Young-Mirsky theorem. Thus, the quantization difficulty is alleviated by the low-rank branch, which runs at 16-bit precision. The below figure illustrates an example value distribution of the input activations and weights in PixArt-∑.

Example value distribution of inputs and weights in PixArt-∑.

Nunchaku: Fusing Low-Rank and Low-Bit Branch Kernels

Although the low-rank branch adds only minor computational costs on paper, running it separately can lead to significant latency overhead—about 50% of the 4-bit branch's latency, as shown in figure (a). This is because, despite reduced computation costs with a small rank, the data size of input and output activations remains the same, shifting the bottleneck to memory access instead of computation.

To address this, we co-designed our inference engine, Nunchaku, with the SVDQuant algorithm. Specifically, we noted that the down projection in the low-rank branch uses the same input as the quantization kernel in the low-bit branch, and the up projection shares the same output as the 4-bit computation kernel, as shown in figure (b). By fusing the down projection with the quantization kernel and the up projection with the 4-bit computation kernel, the low-rank branch can now share activations with the low-bit branch. This eliminates extra memory access and cuts the number of kernel calls in half. As a result, the low-rank branch now adds only 5–10% additional latency, making its cost almost negligible.

Performance

SVDQuant reduces the model size of the 12B FLUX.1 by 3.6×. Additionally, Nunchaku further cuts memory usage of the 16-bit model by 3.5× and delivers 3.0× speedups over the NF4 W4A16 baseline on both the desktop and laptop NVIDIA RTX 4090 GPUs. Remarkably, on laptop 4090, it achieves in total 10.1× speedup by eliminating CPU offloading.

Quality

On FLUX.1 models, our 4-bit models outperform the NF4 W4A16 baselines, demonstrating superior text alignment and closer similarity to the 16-bit models. For instance, NF4 misinterprets "dinosaur style," generating a real dinosaur. On PixArt-∑ and SDXL-Turbo, our 4-bit results demonstrate noticeably better visual quality than ViDiT-Q's and MixDQ's W4A8 results.

‍

Integrate with LoRA

Traditional quantization methods require fusing LoRA branches and then requantizing the model when integrating LoRAs. Our SVDQuant, however, avoids redundant memory access, making it possible to add a separate LoRA branch directly. The figure above shows visual examples of our INT4 FLUX.1-dev model with LoRAs applied in five distinct styles—Realism, Ghibsky Illustration, Anime, Children Sketch, and Yarn Art. Our INT4 model adapts seamlessly to each style, maintaining the image quality of the original 16-bit version.

Efficient AI Computing,
Transforming the Future.

SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup

TL;DR

Related Projects

Background

SVDQuant: Absorbing Outliers via Low-Rank Branch

Nunchaku: Fusing Low-Rank and Low-Bit Branch Kernels

Performance

Quality

Integrate with LoRA

Latest Posts

Categories

Topics

Techniques

Efficient AI Computing,Transforming the Future.

SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup

TL;DR

Related Projects

Background

SVDQuant: Absorbing Outliers via Low-Rank Branch

Nunchaku: Fusing Low-Rank and Low-Bit Branch Kernels

Performance

Quality

Integrate with LoRA

Latest Posts

Categories

Topics

Techniques

Efficient AI Computing,
Transforming the Future.