本文对比了使用xformers与flash attention 2作为后端注意力机制时,vllm的性能差距。
之前几天写过一篇对比使用 xformers 与 flash attention 2 作为后端注意力机制时,vllm的性能差距的文章。但是过几天想想又不太对劲,于是又做了几组测试,结果发现其实 flash attention 2 确实提升明显,即使带宽有60%以上的差距时,通过 flash attention 2 的加速也可以强行把生成吞吐拉平,也就是说对比 xformers 的提升至少是 40%,某些场景下可以达到 70%。于是就有了这篇文章,不做太多分析只放几个结论:
- 30系列改变vllm环境变量 VLLM_ATTENTION_BACKEND=XFORMERS 或许并没有真正使用xformers,因为和fa2速度一致,但是对比2080ti提升巨大
- 2080ti使用 lmdeploy turbomind 推理引擎 可以达到fa2相同提升
- 2080ti使用 lmdeploy 和之前测试结果不太一样,awq和fp16都提升巨大
- 但是在3070上和上述一条不同,所以30系列(推断40甚至之后的50系列一样)都应该用vllm,vllm应该对于30系列之后有专门优化
- 根据参数比例,fa2提升约为40-70%(分别是fp16及awq量化下)
实际硬件参数,以及从下面测试结果提取出来的单请求生成吞吐:
| tflops | bandwidth | vllm | lmdeploy | vllm | lmdeploy |
|---|
| 3070laptop | 29.7 | 319 | 100 | 102 | 191 | 154 |
| 2080ti | 52.5 | 526 | 97 | 176 | 228 | 305 |
| ratio | | 1.65 | 0.97 | 1.73 | 1.19 | 1.98 |
Test Results
AWQ Test Results
3070 laptop
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.80 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 100.00 | 951.17 | 0.05 | 0.05 |
| 2 | 196.12 | 885.67 | 0.10 | 0.10 |
| 4 | 378.49 | 1568.36 | 0.10 | 0.12 |
| 8 | 714.94 | 2434.35 | 0.13 | 0.15 |
| 16 | 1211.09 | 2184.59 | 0.17 | 0.33 |
| 32 | 1682.82 | 3098.94 | 0.14 | 0.47 |
Input Tokens: 5638
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.20 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 72.75 | 3724.02 | 1.52 | 1.52 |
| 2 | 65.69 | 3461.59 | 2.03 | 3.24 |
| 4 | 118.01 | 3355.38 | 2.01 | 6.72 |
| 8 | 129.15 | 3542.59 | 1.94 | 12.66 |
vllm fa2
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.40 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 101.52 | 1106.82 | 0.04 | 0.04 |
| 2 | 199.29 | 1326.93 | 0.07 | 0.07 |
| 4 | 397.01 | 2322.60 | 0.08 | 0.08 |
| 8 | 744.77 | 2998.94 | 0.04 | 0.12 |
| 16 | 1280.79 | 3332.02 | 0.08 | 0.22 |
| 32 | 1880.32 | 3579.14 | 0.05 | 0.40 |
Input Tokens: 5519
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 3.20 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 65.94 | 2966.57 | 1.89 | 1.89 |
| 2 | 113.30 | 3612.69 | 1.85 | 3.11 |
| 4 | 58.97 | 3635.87 | 1.85 | 6.16 |
| 8 | 53.07 | 3576.62 | 1.87 | 12.53 |
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 3.20 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 102.33 | 1162.42 | 0.04 | 0.04 |
| 2 | 202.70 | 1441.08 | 0.03 | 0.07 |
| 4 | 417.91 | 1724.37 | 0.03 | 0.11 |
| 8 | 793.40 | 2305.94 | 0.03 | 0.16 |
| 16 | 1365.04 | 3255.68 | 0.03 | 0.22 |
| 32 | 1939.21 | 3398.90 | 0.03 | 0.43 |
Input Tokens: 5622
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.80 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 26.89 | 4419.59 | 1.29 | 1.29 |
| 2 | 129.84 | 4403.95 | 1.29 | 2.52 |
| 4 | 145.21 | 4466.76 | 1.30 | 5.02 |
| 8 | 172.47 | 4586.10 | 1.33 | 9.78 |
2080ti
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 97.29 | 716.20 | 0.06 | 0.06 |
| 2 | 189.25 | 787.34 | 0.11 | 0.12 |
| 4 | 368.60 | 1630.50 | 0.11 | 0.11 |
| 8 | 673.36 | 2310.67 | 0.11 | 0.16 |
| 16 | 1132.99 | 2949.96 | 0.12 | 0.25 |
| 32 | 1561.65 | 4154.17 | 0.12 | 0.35 |
| 64 | 1653.97 | 4995.70 | 0.14 | 0.58 |
| 128 | 1795.95 | 5930.27 | 0.13 | 0.97 |
Input Tokens: 5648
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 64.25 | 6055.06 | 0.93 | 0.93 |
| 2 | 49.60 | 6011.82 | 0.99 | 1.84 |
| 4 | 187.93 | 6402.48 | 1.48 | 3.52 |
| 8 | 159.36 | 6402.51 | 5.11 | 7.04 |
| 16 | 178.05 | 6434.66 | 4.73 | 13.92 |
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 176.51 | 2713.72 | 0.02 | 0.02 |
| 2 | 350.82 | 3556.20 | 0.02 | 0.03 |
| 4 | 686.61 | 3581.65 | 0.02 | 0.05 |
| 8 | 1327.60 | 4906.68 | 0.02 | 0.08 |
| 16 | 2284.32 | 4964.15 | 0.02 | 0.15 |
| 32 | 3307.78 | 6052.02 | 0.02 | 0.24 |
| 64 | 3933.59 | 5430.29 | 0.01 | 0.53 |
| 128 | 4333.86 | 6620.12 | 0.01 | 0.87 |
Input Tokens: 5539
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 61.08 | 5805.73 | 0.97 | 0.97 |
| 2 | 98.14 | 6489.22 | 0.91 | 1.73 |
| 4 | 185.56 | 6924.31 | 0.92 | 3.24 |
| 8 | 195.81 | 7054.24 | 0.96 | 6.37 |
| 16 | 257.38 | 7292.93 | 0.95 | 12.28 |
FP16 Test Results
3070 laptop
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 4.00 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 191.58 | 396.31 | 0.12 | 0.12 |
| 2 | 188.85 | 1373.13 | 0.07 | 0.07 |
| 4 | 567.70 | 2502.72 | 0.07 | 0.08 |
| 8 | 1079.86 | 4669.36 | 0.07 | 0.08 |
| 16 | 1677.36 | 5520.68 | 0.06 | 0.13 |
| 32 | 2931.42 | 8041.26 | 0.13 | 0.18 |
| 64 | 4236.04 | 9555.88 | 0.11 | 0.31 |
Input Tokens: 5633
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.60 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 103.62 | 23042.82 | 0.25 | 0.25 |
| 2 | 50.94 | 23025.27 | 0.28 | 0.49 |
| 4 | 194.34 | 23268.16 | 0.32 | 0.97 |
| 8 | 369.72 | 23895.54 | 0.32 | 1.88 |
| 16 | 563.25 | 24113.27 | 0.34 | 3.74 |
vllm fa2
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 3.60 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 187.29 | 316.02 | 0.15 | 0.15 |
| 2 | 194.00 | 1993.44 | 0.04 | 0.05 |
| 4 | 613.77 | 2352.95 | 0.03 | 0.08 |
| 8 | 653.82 | 5531.07 | 0.05 | 0.07 |
| 16 | 2119.64 | 5437.47 | 0.03 | 0.14 |
| 32 | 3341.71 | 8709.83 | 0.09 | 0.17 |
| 64 | 4112.55 | 9600.50 | 0.06 | 0.30 |
Input Tokens: 5581
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.60 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 129.81 | 23107.58 | 0.24 | 0.24 |
| 2 | 133.18 | 23550.33 | 0.28 | 0.48 |
| 4 | 124.44 | 24315.44 | 0.28 | 0.93 |
| 8 | 375.86 | 23998.11 | 0.30 | 1.87 |
| 16 | 434.18 | 24096.61 | 0.33 | 3.74 |
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.80 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 154.77 | 2238.66 | 0.02 | 0.02 |
| 2 | 157.46 | 2653.18 | 0.02 | 0.04 |
| 4 | 309.74 | 2791.23 | 0.02 | 0.07 |
| 8 | 578.12 | 6417.18 | 0.02 | 0.06 |
| 16 | 1439.22 | 6484.35 | 0.01 | 0.11 |
| 32 | 2814.97 | 9551.40 | 0.02 | 0.15 |
| 64 | 4251.23 | 10272.75 | 0.02 | 0.28 |
Input Tokens: 5657
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 3.00 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 123.73 | 19862.84 | 0.29 | 0.29 |
| 2 | 107.36 | 21603.39 | 0.29 | 0.53 |
| 4 | 134.85 | 22812.97 | 0.31 | 0.99 |
| 8 | 335.80 | 23905.97 | 0.30 | 1.89 |
| 16 | 288.97 | 24434.42 | 0.34 | 3.68 |
2080ti
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 228.98 | 1092.87 | 0.04 | 0.04 |
| 2 | 423.88 | 1288.86 | 0.07 | 0.07 |
| 4 | 653.95 | 2368.37 | 0.07 | 0.08 |
| 8 | 1051.37 | 4368.51 | 0.08 | 0.08 |
| 16 | 2127.28 | 6250.58 | 0.07 | 0.12 |
| 32 | 3203.85 | 8748.44 | 0.10 | 0.17 |
| 64 | 3942.46 | 10506.13 | 0.09 | 0.28 |
| 128 | 4603.73 | 18207.47 | 0.09 | 0.32 |
Input Tokens: 5581
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 27.15 | 29227.07 | 0.19 | 0.19 |
| 2 | 171.79 | 31110.00 | 0.23 | 0.36 |
| 4 | 340.59 | 34868.82 | 0.24 | 0.65 |
| 8 | 213.44 | 36036.22 | 0.25 | 1.24 |
| 16 | 377.33 | 36375.68 | 0.32 | 2.48 |
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 305.94 | 4199.33 | 0.01 | 0.01 |
| 2 | 341.57 | 6731.39 | 0.01 | 0.02 |
| 4 | 911.32 | 7832.42 | 0.01 | 0.02 |
| 8 | 1551.95 | 7321.06 | 0.02 | 0.05 |
| 16 | 3052.74 | 5795.36 | 0.01 | 0.13 |
| 32 | 4666.90 | 13364.55 | 0.01 | 0.11 |
| 64 | 7360.47 | 13220.15 | 0.02 | 0.22 |
| 128 | 9013.05 | 14792.20 | 0.02 | 0.39 |
Input Tokens: 5622
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.40 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|---|
| 1 | 244.81 | 18598.96 | 0.31 | 0.31 |
| 2 | 271.79 | 28980.70 | 0.29 | 0.39 |
| 4 | 158.41 | 30582.23 | 0.26 | 0.73 |
| 8 | 659.52 | 35179.13 | 0.26 | 1.28 |
| 16 | 501.57 | 39903.70 | 0.30 | 2.26 |