EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

Han Cai, Junyan Li, Muyan Hu, Chuang Gan, Song Han
MIT, Zhejiang University, Tsinghua University, MIT-IBM Watson AI Lab
(* indicates equal contribution)

News

Waiting for more news.

Awards

No items found.

Competition Awards

No items found.

Abstract

High-resolution dense prediction enables many appealing real-world applications, such as computational photography, autonomous driving, etc. However, the vast computational cost makes deploying state-of-the-art high-resolution dense prediction models on hardware devices difficult. This work presents EfficientViT, a new family of high-resolution vision models with novel lightweight multi-scale attention. Unlike prior high-resolution dense prediction models that rely on heavy self-attention, hardware-inefficient large-kernel convolution, or complicated topology structure to obtain good performances, our lightweight multi-scale attention achieves a global receptive field and multi-scale learning (two critical features for high-resolution dense prediction) with only lightweight and hardware-efficient operations. As such, EfficientViT delivers remarkable performance gains over previous state-of-the-art high-resolution dense prediction models with significant speedup on diverse hardware platforms, including mobile CPU, edge GPU, and cloud GPU. Without performance loss on Cityscapes, our EfficientViT provides up to 8.8x and 3.8x GPU latency reduction over SegFormer and SegNeXt, respectively. For super-resolution, EfficientViT provides up to 6.4x speedup over Restormer while providing 0.11dB gain in PSNR.

News

If you are interested in getting updates, please join our mailing list here.

About EfficientViT Models

EfficientViT is a new family of ViT models for efficient high-resolution dense prediction vision tasks. The core building block of EfficientViT is a lightweight, multi-scale linear attention module that achieves global receptive field and multi-scale learning with only hardware-efficient operations, making EfficientViT TensorRT-friendly and suitable for GPU deployment.

Challenge: Apply Transformer to High-Resolution Images

  • High resolution is essential for achieving good performances in dense prediction tasks.
  • Transformer’s computational cost grows quadratically as the input resolution increases.

EfficientViT: Lightweight Vision Transformer for High-Resolution Vision

  • Break the efficiency bottleneck without losing the global receptive field by replacing the heavy softmax attention with relu linear attention.
  • Address relu linear attention’s limitation in local feature extraction by enhancing it with convolution.
  • Further enhance linear attention by multi-scale aggregation.

EfficientViT Applications

Segment Anything

Image Classification

Semantic Segmentation

Video

Citation

@article{

cai2022efficientvit,  

title={Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition},  

author={Cai, Han and Gan, Chuang and Han, Song},  

journal={arXiv preprint arXiv:2205.14756},  

year={2022}

}

Media

Acknowledgment

We thank MIT-IBM Watson AI Lab, MIT AI Hardware Program, Amazon and MIT Science Hub, Qualcomm Innovation Fellowship, National Science Foundation for supporting this research.

Team Members