Sparse Refinement for Efficient High-Resolution Semantic Segmentation

Zhijian Liu*, Zhuoyang Zhang*, Samir Khaki, Shang Yang, Haotian Tang, Chenfeng Xu, Kurt Keutzer, Song Han
MIT, NVIDIA, Tsinghua University, University of Toronto, UC Berkeley
(* indicates equal contribution)

News

Waiting for more news.

Awards

No items found.

Competition Awards

No items found.

Abstract

Semantic segmentation empowers numerous real-world applications, such as autonomous driving and augmented/mixed reality. These applications often operate on high-resolution images (e.g., 8 megapixels) to capture the fine details. However, this comes at the cost of considerable computational complexity, hindering the deployment in latency-sensitive scenarios. In this paper, we introduce SparseRefine, a novel approach that enhances dense low-resolution predictions with sparse high-resolution refinements. Based on coarse low-resolution outputs, SparseRefine first uses an entropy selector to identify a sparse set of pixels with high entropy. It then employs a sparse feature extractor to efficiently generate the refinements for those pixels of interest. Finally, it leverages a gated ensembler to apply these sparse refinements to the initial coarse predictions. SparseRefine can be seamlessly integrated into any existing semantic segmentation model, regardless of CNN- or ViT-based. SparseRefine achieves significant speedup: 1.5 to 3.7 times when applied to HRNet-W48, SegFormer-B5, Mask2Former-T/L and SegNeXt-L on Cityscapes, with negligible to no loss of accuracy. Our "dense+sparse'' paradigm paves the way for efficient high-resolution visual computing.

Method

SparseRefine improves initial dense low-resolution predictions with sparse high-resolution refinements. It first performs the dense low-resolution inference on the downsampled image to obtain the initial prediction. Subsequently, it uses an entropy selector to identify a sparse set of pixels with high entropy, and then employs a sparse feature extractor to efficiently generate refinements for those selected pixels. Afterwards, it applies these sparse refinements to the initial predictions with a gated ensembler.

Result

SparseRefine effectively closes the accuracy gap between low-resolution and high-resolution predictions, achieving a remarkable reduction in computational cost by 1.4 to 3.1 times and inference latency by 1.5 to 3.7 times. In this table, (D) and (S) denote dense and sparse inputs, respectively.

Demo

SparseRefine improves the low-resolution (LR) baseline with substantially better recognition of small, distant objects and finer detail around object boundaries.

Video

Citation

@article{liu2024sparse,
 title={Sparse Refinement for Efficient High-Resolution Semantic Segmentation},
 author={Liu, Zhijian and Zhang, Zhuoyang and Khaki, Samir and Yang, Shang and Tang, Haotian and Xu, Chenfeng and Keutzer, Kurt and Han, Song},
 journal={arXiv preprint arXiv:2407.19014},
 year={2024}
}

Media

No media articles found.

Acknowledgment

This work was supported by MIT-IBM Watson AI Lab, MIT AI Hardware Program, and National Science Foundation.

Team Members