A new post-training training quantization paradigm for diffusion models, which quantize both the weights and activations of FLUX.1 to 4 bits, achieving 3.5× memory and 8.7× latency reduction on a 16GB laptop 4090 GPU.
Check our interactive demo at https://mv-ezproxy-com.ezproxy.canberra.edu.au! Our quantization library is at github.com/mit-han-lab/deepcompressor and inference engine is at github.com/mit-han-lab/nunchaku. Our paper is at this link.
SVDQuant is a post-training quantization technique for 4-bit weights and activations that well maintains visual fidelity. On 12B FLUX.1-dev, it achieves 3.6× memory reduction compared to the BF16 model. By eliminating CPU offloading, it offers 8.7× speedup over the 16-bit model when on a 16GB laptop 4090 GPU, 3× faster than the NF4 W4A16 baseline. On PixArt-∑, it demonstrates significantly superior visual quality over other W4A4 or even W4A8 baselines.
Diffusion models are revolutionizing AI with their ability to generate high-quality images from text prompts. To improve image quality and improve the alignment between text and image, researchers are scaling up these models. As shown in the right figure, while Stable Diffusion 1.4 has 800 million parameters, newer models like AuraFlow and FLUX.1 reach billions, delivering more refined and detailed outputs. However, scaling brings challenges: these models become computationally heavy, demanding high memory and longer processing times, making them prohibitive for real-time applications.
As Moore's law slows down, hardware vendors are turning to low-precision inference, such as NVIDIA's new 4-bit floating point (FP4) precision in Blackwell. In large language models (LLMs), quantization has helped reduce model sizes and speed up inference, primarily by addressing latency from loading model weights. Diffusion models, however, are computationally bound, even for single batches, so quantizing weights alone yields limited gains. To achieve measured speedups, both weights and activations must be quantized to the same bit width; otherwise, the lower precision is upcast during computation, negating any performance benefits.
In this blog, we introduce SVDQuant to quantize both the weights and activations of diffusion models to 4 bits. At such an aggressive level, conventional post-training methods fall short. Unlike smoothing, which redistributes outliers, SVDQuant absorbs them through a high-precision low-rank branch, significantly preserving image quality. Visual examples demonstrate its effectiveness. See the above figure for some visual examples.
Although the low-rank branch adds only minor computational costs on paper, running it separately can lead to significant latency overhead—about 50% of the 4-bit branch's latency, as shown in figure (a). This is because, despite reduced computation costs with a small rank, the data size of input and output activations remains the same, shifting the bottleneck to memory access instead of computation.
To address this, we co-designed our inference engine, Nunchaku, with the SVDQuant algorithm. Specifically, we noted that the down projection in the low-rank branch uses the same input as the quantization kernel in the low-bit branch, and the up projection shares the same output as the 4-bit computation kernel, as shown in figure (b). By fusing the down projection with the quantization kernel and the up projection with the 4-bit computation kernel, the low-rank branch can now share activations with the low-bit branch. This eliminates extra memory access and cuts the number of kernel calls in half. As a result, the low-rank branch now adds only 5–10% additional latency, making its cost almost negligible.
SVDQuant reduces the model size of the 12B FLUX.1 by 3.6×. Additionally, Nunchaku further cuts memory usage of the 16-bit model by 3.5× and delivers 3.0× speedups over the NF4 W4A16 baseline on both the desktop and laptop NVIDIA RTX 4090 GPUs. Remarkably, on laptop 4090, it achieves in total 10.1× speedup by eliminating CPU offloading.
On FLUX.1 models, our 4-bit models outperform the NF4 W4A16 baselines, demonstrating superior text alignment and closer similarity to the 16-bit models. For instance, NF4 misinterprets "dinosaur style," generating a real dinosaur. On PixArt-∑ and SDXL-Turbo, our 4-bit results demonstrate noticeably better visual quality than ViDiT-Q's and MixDQ's W4A8 results.
Traditional quantization methods require fusing LoRA branches and then requantizing the model when integrating LoRAs. Our SVDQuant, however, avoids redundant memory access, making it possible to add a separate LoRA branch directly. The figure above shows visual examples of our INT4 FLUX.1-dev model with LoRAs applied in five distinct styles—Realism, Ghibsky Illustration, Anime, Children Sketch, and Yarn Art. Our INT4 model adapts seamlessly to each style, maintaining the image quality of the original 16-bit version.