We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual generation model capable of directly generating 1024x1024 images, rivaling diffusion models in image generation quality. Existing AR models face limitations due to the poor image reconstruction quality of their discrete tokenizers and the prohibitive training costs associated with generating 1024px images. To address these challenges, we present the hybrid tokenizer, which decomposes the continuous latents from the autoencoder into two components: discrete tokens representing the big picture and continuous tokens representing the residual components that cannot be represented by the discrete tokens. The discrete component is modeled by a scalable-resolution discrete AR model, while the continuous component is learned with a lightweight residual diffusion module with only 37M parameters. Compared with the discrete-only VAR tokenizer, our hybrid approach improves reconstruction FID from 2.11 to 0.30 on MJHQ-30K, leading to a 31% generation FID improvement from 7.85 to 5.38. HART also outperforms state-of-the-art diffusion models in both FID and CLIP score, with 4.5-7.7x higher throughput and 6.9-13.4x lower MACs.
Online demo: https://hart-mit-edu.ezproxy.canberra.edu.au
HART is an early autoregressive model to directly generate 1024x1024 images with quality comparable to diffusion models, while offering significantly improved efficiency. It achieves 4.5-7.7x higher throughput, 3.1-5.9x lower latency (measured on A100), and 6.9-13.4x lower MACs compared to state-of-the-art diffusion models.
Currently, visual generation AR models lag behind diffusion models in two key aspects:
1. Discrete tokenizers in AR models exhibit significantly poorer reconstruction capabilities compared to the continuous tokenizers used by diffusion models. Consequently, AR models have a lower generation upper bound and struggle to accurately model fine image details.
2. Diffusion models excel in high-resolution image synthesis, but no existing AR model can directly and efficiently generate 1024x1024 images.
To address challenge 1, we designed a hybrid tokenizer to improve the generation upper bound of autoregressive (AR) models. As illustrated in the figure below, the discrete tokenizer used by VAR struggles to reconstruct facial details, resulting in a lower generation quality ceiling.
To solve this, we train a hybrid tokenizer that is capable of decoding both continuous and discrete tokens. The continuous tokens are decomposed into discrete tokens, and residual tokens that cannot be represented by discrete codebook elements. The discrete tokens capture the big picture, while the residual tokens model fine details.
Building upon the strong generation upper bound provided by our hybrid tokenizer, we introduce a hybrid transformer capable of learning both discrete and residual tokens to solve challenge 2. This innovative architecture incorporates two key components: a scalable-resolution AR transformer with all position embeddings to be relative embeddings to model discrete tokens, and a lightweight residual diffusion module for learning residual tokens. The latter employs only 37M parameters and requires just 8 sampling steps, enhancing efficiency without compromising performance.
HART achieves a 9.3x higher throughput compared to SD3-medium (Esser et al., 2024) at 512x512 resolution. For 1024×1024 generation, HART achieves at least 3.1x lower latency than state-of-the-art diffusion models. Compared to the similarly sized PixArt-Σ (Chen et al., 2024a), our method achieves 3.6x faster latency and 5.6x higher throughput, which closely aligns with the theoretical 5.8x reduction in MACs. Compared to SDXL, HART not only achieves superior quality across all benchmarks, but also demonstrates 3.1x lower latency and 4.5x higher throughput.
@article{tang2024hart,
title={HART: Efficient Visual Generation with Hybrid Autoregressive Transformer},
author={Tang, Haotian and Wu, Yecheng and Yang, Shang and Xie, Enze and Chen, Junsong and Chen, Junyu and Zhang, Zhuoyang and Cai, Han and Lu, Yao and Han, Song},
journal={arXiv preprint},
year={2024}
}
We thank MIT-IBM Watson AI Lab, MIT and Amazon Science Hub, MIT AI Hardware Program, National Science Foundation for supporting this research. We thank NVIDIA for donating the DGX server. We would like to also thank Tianhong Li from MIT, Lijun Yu from Google DeepMind, Kaiwen Zha from MIT and Yunhao Fang from UCSD for helpful technical discussions, and Paul Palei, Mike Hobbs, Chris Hill and Michel Erb from MIT for helping us set up the online demo and maintain the server.