Efficient AI Computing,
Transforming the Future.

Block Sparse Attention

TL;DR

We introduce Block Sparse Attention, a library of sparse attention kernels that supports various sparse patterns, including streaming attention with token granularity, streaming attention with block granularity, and block-sparse attention. By incorporating these patterns, Block Sparse Attention can significantly reduce the computational costs of LLMs, thereby enhancing their efficiency and scalability. We release the implementation of Block Sparse Attention, which is modified base on FlashAttention 2.4.2.

Related Projects

No related projects.

Background

Nowadays, large language models (LLMs) are revolutionizing AI applications, powering sophisticated tasks such as multi-turn dialogues, long document summarization, and multimodal understanding involving images and videos. These tasks often require handling vast numbers of contextual tokens. For example, summarizing long documents such as Harry Potter series may involve millions of tokens. While in the domain of visual understanding, the challenge is even more pronounced. Assuming a single 224×224 image corresponds to 256 tokens, a three-minute video at 24 FPS will generate over 1 million tokens.

Processing long-context inputs poses a significant challenge to the efficiency of inference systems. Traditional attention requires each token to attend to all previous tokens, resulting in linear increases in decoding latency and quadratic growth in pre-filling time as sequence length expands. This relationship makes long-sequence processing highly computationally expensive, bottlenecking the overall inference speed, expecially the Time-To-First-Token (TTFT).

To address the problem, FlashAttention, Big Bird, ETC come up with the idea of calculating attention with block sparse pattern. While Duo Attention, MInference 1.0 discuss about speeding up the calculating with hybrid sparse pattern. Leveraging the inherent sparsity in attention patterns, we can greatly reduce attention costs in both computation and memory usage, effectively increasing LLM inference speed with long contexts. This approach not only enhances the efficiency of LLMs but also enables them to handle longer and more complex prompts without a proportional increase in resource consumption.

To this end, we introduce Block Sparse Attention, a library of sparse attention kernels develop base on FlashAttention 2.4.2. that supports various sparse patterns, including streaming attention with token granularity, streaming attention with block granularity, and block-sparse attention. By incorporating these patterns, Block Sparse Attention significantly reduces the computational costs of attention, paving the way for more efficient long-context LLM/VLM serving.

Features

We have four patterns supported in Block Sparse Attention:

  • dense attention: Calculate the full attention matrix.
  • streaming attention with token granularity: Calculate the attention with a fixed number of sink tokens and local tokens. You can refer to StreamingLLM for more details.
  • streaming attention with block granularity, block_size = 128: Calculate the attention with a fixed number of sink blocks and local blocks.
  • blocksparse attention, block_size = 128: Take in a block mask and calculate the attention with the block mask.

Importantly, we support assigning different patterns for different heads.

Performance

Block Sparse Speedup

The figures above illustrate the speedup gained by using Block Sparse Attention in comparison to dense FlashAttention2 2.4.2. This speedup was measured on an A100 GPU, with configurations including a head dimension of 128 and 32 attention heads.

Dense & Streaming Hybrid Speedup

In Duo Attention, a hybrid mask scenario where half of the attention heads utilize a dense mask and the other half employ a streaming mask is introduced. This pattern is also proved to be an accurate approach for LLMs inference.

The graph above demonstrates the performance of our kernel for this specified workload. For token-level streaming masks, we allocate 64 sink tokens and 256 local tokens. For block-level streaming masks, we allocate 1 sink block and 3 local blocks, with each block consisting of 128 tokens. Speedup results were measured on an A100 GPU, using dense FlashAttention2 as the baseline, with a head dimension of 128, 32 attention heads, and a batch size of 1.

You can refer to Duo Attention for more details.

Conclusion

In this blog, we present Block Sparse Attention, a library of sparse attention kernels designed to accelerate attention computation by incorporating sparsity. Traditional attention mechanisms, especially in large language models (LLMs), can be computationally expensive due to their quadratic complexity with respect to the sequence length. By leveraging block sparsity, our method effectively reduces the computational cost of attention operations, allowing for more efficient processing of long sequences.

Acknowledgement

- FlashAttention: the codebase we built upon. Thanks for their wonderful work. The design of block sparse attention in FlashAttention v1.0 is very inspiring.

- FlashAttention, FlashAttention-2, Big Bird, ETC: get the idea of block sparse attention and how it can be implemented.

- StreamingLLM: get the idea of streaming attention.

- Duo Attention, MInference 1.0: get the idea of hybrid masks.