目录

重新评价:Flash Attention 2 的真正实力

本文对比了使用xformers与flash attention 2作为后端注意力机制时,vllm的性能差距。

之前几天写过一篇对比使用 xformers 与 flash attention 2 作为后端注意力机制时,vllm的性能差距的文章。但是过几天想想又不太对劲,于是又做了几组测试,结果发现其实 flash attention 2 确实提升明显,即使带宽有60%以上的差距时,通过 flash attention 2 的加速也可以强行把生成吞吐拉平,也就是说对比 xformers 的提升至少是 40%,某些场景下可以达到 70%。于是就有了这篇文章,不做太多分析只放几个结论:

  1. 30系列改变vllm环境变量 VLLM_ATTENTION_BACKEND=XFORMERS 或许并没有真正使用xformers,因为和fa2速度一致,但是对比2080ti提升巨大
  2. 2080ti使用 lmdeploy turbomind 推理引擎 可以达到fa2相同提升
  3. 2080ti使用 lmdeploy 和之前测试结果不太一样,awq和fp16都提升巨大
  4. 但是在3070上和上述一条不同,所以30系列(推断40甚至之后的50系列一样)都应该用vllm,vllm应该对于30系列之后有专门优化
  5. 根据参数比例,fa2提升约为40-70%(分别是fp16及awq量化下)

实际硬件参数,以及从下面测试结果提取出来的单请求生成吞吐:

tflops bandwidth vllm lmdeploy vllm lmdeploy
3070laptop 29.7 319 100 102 191 154
2080ti 52.5 526 97 176 228 305
ratio 1.65 0.97 1.73 1.19 1.98

Test Results

AWQ Test Results

3070 laptop

vllm xformer
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 100.00 951.17 0.05 0.05
2 196.12 885.67 0.10 0.10
4 378.49 1568.36 0.10 0.12
8 714.94 2434.35 0.13 0.15
16 1211.09 2184.59 0.17 0.33
32 1682.82 3098.94 0.14 0.47
Input Tokens: 5638
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.20 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 72.75 3724.02 1.52 1.52
2 65.69 3461.59 2.03 3.24
4 118.01 3355.38 2.01 6.72
8 129.15 3542.59 1.94 12.66
vllm fa2
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.40 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 101.52 1106.82 0.04 0.04
2 199.29 1326.93 0.07 0.07
4 397.01 2322.60 0.08 0.08
8 744.77 2998.94 0.04 0.12
16 1280.79 3332.02 0.08 0.22
32 1880.32 3579.14 0.05 0.40
Input Tokens: 5519
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 3.20 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 65.94 2966.57 1.89 1.89
2 113.30 3612.69 1.85 3.11
4 58.97 3635.87 1.85 6.16
8 53.07 3576.62 1.87 12.53
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 3.20 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 102.33 1162.42 0.04 0.04
2 202.70 1441.08 0.03 0.07
4 417.91 1724.37 0.03 0.11
8 793.40 2305.94 0.03 0.16
16 1365.04 3255.68 0.03 0.22
32 1939.21 3398.90 0.03 0.43
Input Tokens: 5622
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 26.89 4419.59 1.29 1.29
2 129.84 4403.95 1.29 2.52
4 145.21 4466.76 1.30 5.02
8 172.47 4586.10 1.33 9.78

2080ti

vllm xformer
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 97.29 716.20 0.06 0.06
2 189.25 787.34 0.11 0.12
4 368.60 1630.50 0.11 0.11
8 673.36 2310.67 0.11 0.16
16 1132.99 2949.96 0.12 0.25
32 1561.65 4154.17 0.12 0.35
64 1653.97 4995.70 0.14 0.58
128 1795.95 5930.27 0.13 0.97
Input Tokens: 5648
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 64.25 6055.06 0.93 0.93
2 49.60 6011.82 0.99 1.84
4 187.93 6402.48 1.48 3.52
8 159.36 6402.51 5.11 7.04
16 178.05 6434.66 4.73 13.92
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 176.51 2713.72 0.02 0.02
2 350.82 3556.20 0.02 0.03
4 686.61 3581.65 0.02 0.05
8 1327.60 4906.68 0.02 0.08
16 2284.32 4964.15 0.02 0.15
32 3307.78 6052.02 0.02 0.24
64 3933.59 5430.29 0.01 0.53
128 4333.86 6620.12 0.01 0.87
Input Tokens: 5539
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 61.08 5805.73 0.97 0.97
2 98.14 6489.22 0.91 1.73
4 185.56 6924.31 0.92 3.24
8 195.81 7054.24 0.96 6.37
16 257.38 7292.93 0.95 12.28

FP16 Test Results

3070 laptop

vllm xformer
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 4.00 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 191.58 396.31 0.12 0.12
2 188.85 1373.13 0.07 0.07
4 567.70 2502.72 0.07 0.08
8 1079.86 4669.36 0.07 0.08
16 1677.36 5520.68 0.06 0.13
32 2931.42 8041.26 0.13 0.18
64 4236.04 9555.88 0.11 0.31
Input Tokens: 5633
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.60 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 103.62 23042.82 0.25 0.25
2 50.94 23025.27 0.28 0.49
4 194.34 23268.16 0.32 0.97
8 369.72 23895.54 0.32 1.88
16 563.25 24113.27 0.34 3.74
vllm fa2
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 3.60 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 187.29 316.02 0.15 0.15
2 194.00 1993.44 0.04 0.05
4 613.77 2352.95 0.03 0.08
8 653.82 5531.07 0.05 0.07
16 2119.64 5437.47 0.03 0.14
32 3341.71 8709.83 0.09 0.17
64 4112.55 9600.50 0.06 0.30
Input Tokens: 5581
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.60 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 129.81 23107.58 0.24 0.24
2 133.18 23550.33 0.28 0.48
4 124.44 24315.44 0.28 0.93
8 375.86 23998.11 0.30 1.87
16 434.18 24096.61 0.33 3.74
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 154.77 2238.66 0.02 0.02
2 157.46 2653.18 0.02 0.04
4 309.74 2791.23 0.02 0.07
8 578.12 6417.18 0.02 0.06
16 1439.22 6484.35 0.01 0.11
32 2814.97 9551.40 0.02 0.15
64 4251.23 10272.75 0.02 0.28
Input Tokens: 5657
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 3.00 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 123.73 19862.84 0.29 0.29
2 107.36 21603.39 0.29 0.53
4 134.85 22812.97 0.31 0.99
8 335.80 23905.97 0.30 1.89
16 288.97 24434.42 0.34 3.68

2080ti

vllm xformer
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 228.98 1092.87 0.04 0.04
2 423.88 1288.86 0.07 0.07
4 653.95 2368.37 0.07 0.08
8 1051.37 4368.51 0.08 0.08
16 2127.28 6250.58 0.07 0.12
32 3203.85 8748.44 0.10 0.17
64 3942.46 10506.13 0.09 0.28
128 4603.73 18207.47 0.09 0.32
Input Tokens: 5581
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 27.15 29227.07 0.19 0.19
2 171.79 31110.00 0.23 0.36
4 340.59 34868.82 0.24 0.65
8 213.44 36036.22 0.25 1.24
16 377.33 36375.68 0.32 2.48
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 305.94 4199.33 0.01 0.01
2 341.57 6731.39 0.01 0.02
4 911.32 7832.42 0.01 0.02
8 1551.95 7321.06 0.02 0.05
16 3052.74 5795.36 0.01 0.13
32 4666.90 13364.55 0.01 0.11
64 7360.47 13220.15 0.02 0.22
128 9013.05 14792.20 0.02 0.39
Input Tokens: 5622
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.40 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 244.81 18598.96 0.31 0.31
2 271.79 28980.70 0.29 0.39
4 158.41 30582.23 0.26 0.73
8 659.52 35179.13 0.26 1.28
16 501.57 39903.70 0.30 2.26