本文对比了使用xformers与flash attention 2作为后端注意力机制时,vllm的性能差距。
之前几天写过一篇对比使用 xformers 与 flash attention 2 作为后端注意力机制时,vllm的性能差距的文章。但是过几天想想又不太对劲,于是又做了几组测试,结果发现其实 flash attention 2 确实提升明显,即使带宽有60%以上的差距时,通过 flash attention 2 的加速也可以强行把生成吞吐拉平,也就是说对比 xformers 的提升至少是 40%,某些场景下可以达到 70%。于是就有了这篇文章,不做太多分析只放几个结论:
- 30系列改变vllm环境变量 VLLM_ATTENTION_BACKEND=XFORMERS 或许并没有真正使用xformers,因为和fa2速度一致,但是对比2080ti提升巨大
- 2080ti使用 lmdeploy turbomind 推理引擎 可以达到fa2相同提升
- 2080ti使用 lmdeploy 和之前测试结果不太一样,awq和fp16都提升巨大
- 但是在3070上和上述一条不同,所以30系列(推断40甚至之后的50系列一样)都应该用vllm,vllm应该对于30系列之后有专门优化
- 根据参数比例,fa2提升约为40-70%(分别是fp16及awq量化下)
实际硬件参数,以及从下面测试结果提取出来的单请求生成吞吐:
|
tflops |
bandwidth |
vllm |
lmdeploy |
vllm |
lmdeploy |
3070laptop |
29.7 |
319 |
100 |
102 |
191 |
154 |
2080ti |
52.5 |
526 |
97 |
176 |
228 |
305 |
ratio |
|
1.65 |
0.97 |
1.73 |
1.19 |
1.98 |
Test Results
AWQ Test Results
3070 laptop
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.80 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
100.00 |
951.17 |
0.05 |
0.05 |
2 |
196.12 |
885.67 |
0.10 |
0.10 |
4 |
378.49 |
1568.36 |
0.10 |
0.12 |
8 |
714.94 |
2434.35 |
0.13 |
0.15 |
16 |
1211.09 |
2184.59 |
0.17 |
0.33 |
32 |
1682.82 |
3098.94 |
0.14 |
0.47 |
Input Tokens: 5638
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.20 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
72.75 |
3724.02 |
1.52 |
1.52 |
2 |
65.69 |
3461.59 |
2.03 |
3.24 |
4 |
118.01 |
3355.38 |
2.01 |
6.72 |
8 |
129.15 |
3542.59 |
1.94 |
12.66 |
vllm fa2
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.40 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
101.52 |
1106.82 |
0.04 |
0.04 |
2 |
199.29 |
1326.93 |
0.07 |
0.07 |
4 |
397.01 |
2322.60 |
0.08 |
0.08 |
8 |
744.77 |
2998.94 |
0.04 |
0.12 |
16 |
1280.79 |
3332.02 |
0.08 |
0.22 |
32 |
1880.32 |
3579.14 |
0.05 |
0.40 |
Input Tokens: 5519
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 3.20 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
65.94 |
2966.57 |
1.89 |
1.89 |
2 |
113.30 |
3612.69 |
1.85 |
3.11 |
4 |
58.97 |
3635.87 |
1.85 |
6.16 |
8 |
53.07 |
3576.62 |
1.87 |
12.53 |
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 3.20 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
102.33 |
1162.42 |
0.04 |
0.04 |
2 |
202.70 |
1441.08 |
0.03 |
0.07 |
4 |
417.91 |
1724.37 |
0.03 |
0.11 |
8 |
793.40 |
2305.94 |
0.03 |
0.16 |
16 |
1365.04 |
3255.68 |
0.03 |
0.22 |
32 |
1939.21 |
3398.90 |
0.03 |
0.43 |
Input Tokens: 5622
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.80 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
26.89 |
4419.59 |
1.29 |
1.29 |
2 |
129.84 |
4403.95 |
1.29 |
2.52 |
4 |
145.21 |
4466.76 |
1.30 |
5.02 |
8 |
172.47 |
4586.10 |
1.33 |
9.78 |
2080ti
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
97.29 |
716.20 |
0.06 |
0.06 |
2 |
189.25 |
787.34 |
0.11 |
0.12 |
4 |
368.60 |
1630.50 |
0.11 |
0.11 |
8 |
673.36 |
2310.67 |
0.11 |
0.16 |
16 |
1132.99 |
2949.96 |
0.12 |
0.25 |
32 |
1561.65 |
4154.17 |
0.12 |
0.35 |
64 |
1653.97 |
4995.70 |
0.14 |
0.58 |
128 |
1795.95 |
5930.27 |
0.13 |
0.97 |
Input Tokens: 5648
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
64.25 |
6055.06 |
0.93 |
0.93 |
2 |
49.60 |
6011.82 |
0.99 |
1.84 |
4 |
187.93 |
6402.48 |
1.48 |
3.52 |
8 |
159.36 |
6402.51 |
5.11 |
7.04 |
16 |
178.05 |
6434.66 |
4.73 |
13.92 |
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
176.51 |
2713.72 |
0.02 |
0.02 |
2 |
350.82 |
3556.20 |
0.02 |
0.03 |
4 |
686.61 |
3581.65 |
0.02 |
0.05 |
8 |
1327.60 |
4906.68 |
0.02 |
0.08 |
16 |
2284.32 |
4964.15 |
0.02 |
0.15 |
32 |
3307.78 |
6052.02 |
0.02 |
0.24 |
64 |
3933.59 |
5430.29 |
0.01 |
0.53 |
128 |
4333.86 |
6620.12 |
0.01 |
0.87 |
Input Tokens: 5539
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
61.08 |
5805.73 |
0.97 |
0.97 |
2 |
98.14 |
6489.22 |
0.91 |
1.73 |
4 |
185.56 |
6924.31 |
0.92 |
3.24 |
8 |
195.81 |
7054.24 |
0.96 |
6.37 |
16 |
257.38 |
7292.93 |
0.95 |
12.28 |
FP16 Test Results
3070 laptop
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 4.00 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
191.58 |
396.31 |
0.12 |
0.12 |
2 |
188.85 |
1373.13 |
0.07 |
0.07 |
4 |
567.70 |
2502.72 |
0.07 |
0.08 |
8 |
1079.86 |
4669.36 |
0.07 |
0.08 |
16 |
1677.36 |
5520.68 |
0.06 |
0.13 |
32 |
2931.42 |
8041.26 |
0.13 |
0.18 |
64 |
4236.04 |
9555.88 |
0.11 |
0.31 |
Input Tokens: 5633
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.60 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
103.62 |
23042.82 |
0.25 |
0.25 |
2 |
50.94 |
23025.27 |
0.28 |
0.49 |
4 |
194.34 |
23268.16 |
0.32 |
0.97 |
8 |
369.72 |
23895.54 |
0.32 |
1.88 |
16 |
563.25 |
24113.27 |
0.34 |
3.74 |
vllm fa2
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 3.60 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
187.29 |
316.02 |
0.15 |
0.15 |
2 |
194.00 |
1993.44 |
0.04 |
0.05 |
4 |
613.77 |
2352.95 |
0.03 |
0.08 |
8 |
653.82 |
5531.07 |
0.05 |
0.07 |
16 |
2119.64 |
5437.47 |
0.03 |
0.14 |
32 |
3341.71 |
8709.83 |
0.09 |
0.17 |
64 |
4112.55 |
9600.50 |
0.06 |
0.30 |
Input Tokens: 5581
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.60 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
129.81 |
23107.58 |
0.24 |
0.24 |
2 |
133.18 |
23550.33 |
0.28 |
0.48 |
4 |
124.44 |
24315.44 |
0.28 |
0.93 |
8 |
375.86 |
23998.11 |
0.30 |
1.87 |
16 |
434.18 |
24096.61 |
0.33 |
3.74 |
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.80 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
154.77 |
2238.66 |
0.02 |
0.02 |
2 |
157.46 |
2653.18 |
0.02 |
0.04 |
4 |
309.74 |
2791.23 |
0.02 |
0.07 |
8 |
578.12 |
6417.18 |
0.02 |
0.06 |
16 |
1439.22 |
6484.35 |
0.01 |
0.11 |
32 |
2814.97 |
9551.40 |
0.02 |
0.15 |
64 |
4251.23 |
10272.75 |
0.02 |
0.28 |
Input Tokens: 5657
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 3.00 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
123.73 |
19862.84 |
0.29 |
0.29 |
2 |
107.36 |
21603.39 |
0.29 |
0.53 |
4 |
134.85 |
22812.97 |
0.31 |
0.99 |
8 |
335.80 |
23905.97 |
0.30 |
1.89 |
16 |
288.97 |
24434.42 |
0.34 |
3.68 |
2080ti
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
228.98 |
1092.87 |
0.04 |
0.04 |
2 |
423.88 |
1288.86 |
0.07 |
0.07 |
4 |
653.95 |
2368.37 |
0.07 |
0.08 |
8 |
1051.37 |
4368.51 |
0.08 |
0.08 |
16 |
2127.28 |
6250.58 |
0.07 |
0.12 |
32 |
3203.85 |
8748.44 |
0.10 |
0.17 |
64 |
3942.46 |
10506.13 |
0.09 |
0.28 |
128 |
4603.73 |
18207.47 |
0.09 |
0.32 |
Input Tokens: 5581
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
27.15 |
29227.07 |
0.19 |
0.19 |
2 |
171.79 |
31110.00 |
0.23 |
0.36 |
4 |
340.59 |
34868.82 |
0.24 |
0.65 |
8 |
213.44 |
36036.22 |
0.25 |
1.24 |
16 |
377.33 |
36375.68 |
0.32 |
2.48 |
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
305.94 |
4199.33 |
0.01 |
0.01 |
2 |
341.57 |
6731.39 |
0.01 |
0.02 |
4 |
911.32 |
7832.42 |
0.01 |
0.02 |
8 |
1551.95 |
7321.06 |
0.02 |
0.05 |
16 |
3052.74 |
5795.36 |
0.01 |
0.13 |
32 |
4666.90 |
13364.55 |
0.01 |
0.11 |
64 |
7360.47 |
13220.15 |
0.02 |
0.22 |
128 |
9013.05 |
14792.20 |
0.02 |
0.39 |
Input Tokens: 5622
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.40 ms
Concurrency |
Generation Throughput (tokens/s) |
Prompt Throughput (tokens/s) |
Min TTFT (s) |
Max TTFT (s) |
1 |
244.81 |
18598.96 |
0.31 |
0.31 |
2 |
271.79 |
28980.70 |
0.29 |
0.39 |
4 |
158.41 |
30582.23 |
0.26 |
0.73 |
8 |
659.52 |
35179.13 |
0.26 |
1.28 |
16 |
501.57 |
39903.70 |
0.30 |
2.26 |