Contents

Re-evaluating: The True Power of Flash Attention 2

This article compares the performance differences of vllm when using xformers and flash attention 2 as the backend attention mechanism.

Previously, I wrote an article comparing the performance differences of vllm when using xformers and flash attention 2 as backend attention mechanisms. However, after a few days of reconsideration, I felt something was off. So, I conducted several more tests and discovered that flash attention 2 does indeed provide a significant boost. Even with a bandwidth difference of over 60%, the acceleration from flash attention 2 can forcefully equalize the generation throughput. This means that the improvement over xformers is at least 40%, and in some scenarios, it can reach 70%. Hence, this article presents several conclusions without extensive analysis:

  1. For the 30 series, changing the vllm environment variable VLLM_ATTENTION_BACKEND=XFORMERS might not actually use xformers, as the speed is consistent with fa2. However, there is a significant improvement compared to the 2080ti.
  2. Using the lmdeploy turbomind inference engine on the 2080ti can achieve the same performance boost as fa2.
  3. The lmdeploy results on the 2080ti differ from previous tests, showing significant improvements in both awq and fp16.
  4. However, the 3070 shows different behavior compared to the above, so the 30 series (and presumably the 40 and 50 series) should use vllm, as vllm seems to have specific optimizations for the 30 series and later.
  5. Depending on parameter ratios, fa2 provides an improvement of approximately 40-70% (for fp16 and awq quantization, respectively).

Actual hardware parameters and single-request generation throughput extracted from the test results below:

tflops bandwidth vllm lmdeploy vllm lmdeploy
3070laptop 29.7 319 100 102 191 154
2080ti 52.5 526 97 176 228 305
ratio 1.65 0.97 1.73 1.19 1.98

Test Results

AWQ Test Results

3070 laptop

vllm xformer
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 100.00 951.17 0.05 0.05
2 196.12 885.67 0.10 0.10
4 378.49 1568.36 0.10 0.12
8 714.94 2434.35 0.13 0.15
16 1211.09 2184.59 0.17 0.33
32 1682.82 3098.94 0.14 0.47
Input Tokens: 5638
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.20 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 72.75 3724.02 1.52 1.52
2 65.69 3461.59 2.03 3.24
4 118.01 3355.38 2.01 6.72
8 129.15 3542.59 1.94 12.66
vllm fa2
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.40 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 101.52 1106.82 0.04 0.04
2 199.29 1326.93 0.07 0.07
4 397.01 2322.60 0.08 0.08
8 744.77 2998.94 0.04 0.12
16 1280.79 3332.02 0.08 0.22
32 1880.32 3579.14 0.05 0.40
Input Tokens: 5519
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 3.20 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 65.94 2966.57 1.89 1.89
2 113.30 3612.69 1.85 3.11
4 58.97 3635.87 1.85 6.16
8 53.07 3576.62 1.87 12.53
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 3.20 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 102.33 1162.42 0.04 0.04
2 202.70 1441.08 0.03 0.07
4 417.91 1724.37 0.03 0.11
8 793.40 2305.94 0.03 0.16
16 1365.04 3255.68 0.03 0.22
32 1939.21 3398.90 0.03 0.43
Input Tokens: 5622
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 26.89 4419.59 1.29 1.29
2 129.84 4403.95 1.29 2.52
4 145.21 4466.76 1.30 5.02
8 172.47 4586.10 1.33 9.78

2080ti

vllm xformer
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 97.29 716.20 0.06 0.06
2 189.25 787.34 0.11 0.12
4 368.60 1630.50 0.11 0.11
8 673.36 2310.67 0.11 0.16
16 1132.99 2949.96 0.12 0.25
32 1561.65 4154.17 0.12 0.35
64 1653.97 4995.70 0.14 0.58
128 1795.95 5930.27 0.13 0.97
Input Tokens: 5648
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 64.25 6055.06 0.93 0.93
2 49.60 6011.82 0.99 1.84
4 187.93 6402.48 1.48 3.52
8 159.36 6402.51 5.11 7.04
16 178.05 6434.66 4.73 13.92
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 176.51 2713.72 0.02 0.02
2 350.82 3556.20 0.02 0.03
4 686.61 3581.65 0.02 0.05
8 1327.60 4906.68 0.02 0.08
16 2284.32 4964.15 0.02 0.15
32 3307.78 6052.02 0.02 0.24
64 3933.59 5430.29 0.01 0.53
128 4333.86 6620.12 0.01 0.87
Input Tokens: 5539
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 61.08 5805.73 0.97 0.97
2 98.14 6489.22 0.91 1.73
4 185.56 6924.31 0.92 3.24
8 195.81 7054.24 0.96 6.37
16 257.38 7292.93 0.95 12.28

FP16 Test Results

3070 laptop

vllm xformer
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 4.00 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 191.58 396.31 0.12 0.12
2 188.85 1373.13 0.07 0.07
4 567.70 2502.72 0.07 0.08
8 1079.86 4669.36 0.07 0.08
16 1677.36 5520.68 0.06 0.13
32 2931.42 8041.26 0.13 0.18
64 4236.04 9555.88 0.11 0.31
Input Tokens: 5633
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.60 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 103.62 23042.82 0.25 0.25
2 50.94 23025.27 0.28 0.49
4 194.34 23268.16 0.32 0.97
8 369.72 23895.54 0.32 1.88
16 563.25 24113.27 0.34 3.74
vllm fa2
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 3.60 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 187.29 316.02 0.15 0.15
2 194.00 1993.44 0.04 0.05
4 613.77 2352.95 0.03 0.08
8 653.82 5531.07 0.05 0.07
16 2119.64 5437.47 0.03 0.14
32 3341.71 8709.83 0.09 0.17
64 4112.55 9600.50 0.06 0.30
Input Tokens: 5581
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.60 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 129.81 23107.58 0.24 0.24
2 133.18 23550.33 0.28 0.48
4 124.44 24315.44 0.28 0.93
8 375.86 23998.11 0.30 1.87
16 434.18 24096.61 0.33 3.74
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 154.77 2238.66 0.02 0.02
2 157.46 2653.18 0.02 0.04
4 309.74 2791.23 0.02 0.07
8 578.12 6417.18 0.02 0.06
16 1439.22 6484.35 0.01 0.11
32 2814.97 9551.40 0.02 0.15
64 4251.23 10272.75 0.02 0.28
Input Tokens: 5657
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 3.00 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 123.73 19862.84 0.29 0.29
2 107.36 21603.39 0.29 0.53
4 134.85 22812.97 0.31 0.99
8 335.80 23905.97 0.30 1.89
16 288.97 24434.42 0.34 3.68

2080ti

vllm xformer
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 228.98 1092.87 0.04 0.04
2 423.88 1288.86 0.07 0.07
4 653.95 2368.37 0.07 0.08
8 1051.37 4368.51 0.08 0.08
16 2127.28 6250.58 0.07 0.12
32 3203.85 8748.44 0.10 0.17
64 3942.46 10506.13 0.09 0.28
128 4603.73 18207.47 0.09 0.32
Input Tokens: 5581
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 27.15 29227.07 0.19 0.19
2 171.79 31110.00 0.23 0.36
4 340.59 34868.82 0.24 0.65
8 213.44 36036.22 0.25 1.24
16 377.33 36375.68 0.32 2.48
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 305.94 4199.33 0.01 0.01
2 341.57 6731.39 0.01 0.02
4 911.32 7832.42 0.01 0.02
8 1551.95 7321.06 0.02 0.05
16 3052.74 5795.36 0.01 0.13
32 4666.90 13364.55 0.01 0.11
64 7360.47 13220.15 0.02 0.22
128 9013.05 14792.20 0.02 0.39
Input Tokens: 5622
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.40 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 244.81 18598.96 0.31 0.31
2 271.79 28980.70 0.29 0.39
4 158.41 30582.23 0.26 0.73
8 659.52 35179.13 0.26 1.28
16 501.57 39903.70 0.30 2.26