Efficient AI Computing,
Transforming the Future.

Blog Posts

SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup

Nov 7, 2024

A new post-training training quantization paradigm for diffusion models, which quantize both the weights and activations of FLUX.1 to 4 bits, achieving 3.5× memory and 8.7× latency reduction on a 16GB laptop 4090 GPU.

Read Article

Block Sparse Attention

Oct 10, 2024

We introduce Block Sparse Attention, a library of sparse attention kernels that supports various sparse patterns, including streaming attention with token granularity, streaming attention with block granularity, and block-sparse attention. By incorporating these patterns, Block Sparse Attention can significantly reduce the computational costs of LLMs, thereby enhancing their efficiency and scalability. We release the implementation of Block Sparse Attention, which is modified base on FlashAttention 2.4.2.

Read Article

Patch Conv: Patch Convolution to Avoid Large GPU Memory Usage of Conv2D

Mar 10, 2024

In this blog, we introduce Patch Conv to reduce memory footprint when generating high-resolution images. PatchConv significantly cuts down the memory usage by over 2.4× compared to existing PyTorch implementation. Code: https://github.com/mit-han-lab/patch_conv

Read Article

TinyChat: Visual Language Models & Edge AI 2.0

Mar 3, 2024

Explore the latest advancement in TinyChat and AWQ – the integration of Visual Language Models (VLM) on the edge! The exciting advancements in VLM allows LLMs to comprehend visual inputs, enabling seamless image understanding tasks like caption generation, question answering, and more. With the latest release, TinyChat now supports leading VLMs such as VILA, which can be easily quantized with AWQ, empowering users with seamless experience for image understanding tasks.

Read Article

Efficient AI Computing,
Transforming the Future.

Blog Posts

SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup

Block Sparse Attention

Patch Conv: Patch Convolution to Avoid Large GPU Memory Usage of Conv2D

TinyChat: Visual Language Models & Edge AI 2.0

Latest Posts

Categories

Topics

Techniques

Efficient AI Computing,Transforming the Future.

Blog Posts

SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup

Block Sparse Attention

Patch Conv: Patch Convolution to Avoid Large GPU Memory Usage of Conv2D

TinyChat: Visual Language Models & Edge AI 2.0

Latest Posts

Categories

Topics

Techniques

Efficient AI Computing,
Transforming the Future.