Large Language Models (LLMs)

Projects

Efficient Streaming Language Models with Attention Sinks

ICLR 2024

(

)

We enable LLMs to work on infinite-length texts without compromising efficiency and performance.

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

MLSys 2024

(

)

Low-bit weight-only quantization for LLMs.

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

ICLR 2024

(

)

LongLoRA takes advantage of shifted sparse attention to greatly reduce the finetuning cost of long context LLMs.

Tiny Machine Learning Projects

NeurIPS 2020/2021/2022, MICRO 2023, ICML 2023, MLSys 2024, IEEE CAS Magazine 2023

(

Feature

)

This TinyML project aims to enable efficient AI computing on the edge by innovating model compression techniques as well as high-performance system design.

PockEngine: Sparse and Efficient Fine-tuning in a Pocket

MICRO 2023

(

)

This project introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality.

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

ICML 2023

(

)

We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs.

Blog Posts

Block Sparse Attention

October 10, 2024

We introduce Block Sparse Attention, a library of sparse attention kernels that supports various sparse patterns, including streaming attention with token granularity, streaming attention with block granularity, and block-sparse attention. By incorporating these patterns, Block Sparse Attention can significantly reduce the computational costs of LLMs, thereby enhancing their efficiency and scalability. We release the implementation of Block Sparse Attention, which is modified base on FlashAttention 2.4.2.

TinyChat: Visual Language Models & Edge AI 2.0

March 3, 2024

Explore the latest advancement in TinyChat and AWQ – the integration of Visual Language Models (VLM) on the edge! The exciting advancements in VLM allows LLMs to comprehend visual inputs, enabling seamless image understanding tasks like caption generation, question answering, and more. With the latest release, TinyChat now supports leading VLMs such as VILA, which can be easily quantized with AWQ, empowering users with seamless experience for image understanding tasks.

TinyChat: Large Language Model on the Edge

September 6, 2023

Running large language models (LLMs) on the edge is of great importance. In this blog, we introduce TinyChat, an efficient and lightweight system for LLM deployment on the edge. It runs Meta's latest LLaMA-2 model at 30 tokens / second on NVIDIA Jetson Orin and can easily support different models and hardware.