目录

重新评价:Flash Attention 2 的真正实力

本文对比了使用xformers与flash attention 2作为后端注意力机制时,vllm的性能差距。

之前几天写过一篇对比使用 xformers 与 flash attention 2 作为后端注意力机制时,vllm的性能差距的文章。但是过几天想想又不太对劲,于是又做了几组测试,结果发现其实 flash attention 2 确实提升明显,即使带宽有60%以上的差距时,通过 flash attention 2 的加速也可以强行把生成吞吐拉平,也就是说对比 xformers 的提升至少是 40%,某些场景下可以达到 70%。于是就有了这篇文章,不做太多分析只放几个结论:

  1. 30系列改变vllm环境变量 VLLM_ATTENTION_BACKEND=XFORMERS 或许并没有真正使用xformers,因为和fa2速度一致,但是对比2080ti提升巨大
  2. 2080ti使用 lmdeploy turbomind 推理引擎 可以达到fa2相同提升
  3. 2080ti使用 lmdeploy 和之前测试结果不太一样,awq和fp16都提升巨大
  4. 但是在3070上和上述一条不同,所以30系列(推断40甚至之后的50系列一样)都应该用vllm,vllm应该对于30系列之后有专门优化
  5. 根据参数比例,fa2提升约为40-70%(分别是fp16及awq量化下)

实际硬件参数,以及从下面测试结果提取出来的单请求生成吞吐:

tflopsbandwidthvllmlmdeployvllmlmdeploy
3070laptop29.7319100102191154
2080ti52.552697176228305
ratio1.650.971.731.191.98

Test Results

AWQ Test Results

3070 laptop

vllm xformer
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.80 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
1100.00951.170.050.05
2196.12885.670.100.10
4378.491568.360.100.12
8714.942434.350.130.15
161211.092184.590.170.33
321682.823098.940.140.47
Input Tokens: 5638
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.20 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
172.753724.021.521.52
265.693461.592.033.24
4118.013355.382.016.72
8129.153542.591.9412.66
vllm fa2
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.40 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
1101.521106.820.040.04
2199.291326.930.070.07
4397.012322.600.080.08
8744.772998.940.040.12
161280.793332.020.080.22
321880.323579.140.050.40
Input Tokens: 5519
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 3.20 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
165.942966.571.891.89
2113.303612.691.853.11
458.973635.871.856.16
853.073576.621.8712.53
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 3.20 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
1102.331162.420.040.04
2202.701441.080.030.07
4417.911724.370.030.11
8793.402305.940.030.16
161365.043255.680.030.22
321939.213398.900.030.43
Input Tokens: 5622
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.80 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
126.894419.591.291.29
2129.844403.951.292.52
4145.214466.761.305.02
8172.474586.101.339.78

2080ti

vllm xformer
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
197.29716.200.060.06
2189.25787.340.110.12
4368.601630.500.110.11
8673.362310.670.110.16
161132.992949.960.120.25
321561.654154.170.120.35
641653.974995.700.140.58
1281795.955930.270.130.97
Input Tokens: 5648
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
164.256055.060.930.93
249.606011.820.991.84
4187.936402.481.483.52
8159.366402.515.117.04
16178.056434.664.7313.92
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
1176.512713.720.020.02
2350.823556.200.020.03
4686.613581.650.020.05
81327.604906.680.020.08
162284.324964.150.020.15
323307.786052.020.020.24
643933.595430.290.010.53
1284333.866620.120.010.87
Input Tokens: 5539
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
161.085805.730.970.97
298.146489.220.911.73
4185.566924.310.923.24
8195.817054.240.966.37
16257.387292.930.9512.28

FP16 Test Results

3070 laptop

vllm xformer
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 4.00 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
1191.58396.310.120.12
2188.851373.130.070.07
4567.702502.720.070.08
81079.864669.360.070.08
161677.365520.680.060.13
322931.428041.260.130.18
644236.049555.880.110.31
Input Tokens: 5633
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.60 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
1103.6223042.820.250.25
250.9423025.270.280.49
4194.3423268.160.320.97
8369.7223895.540.321.88
16563.2524113.270.343.74
vllm fa2
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 3.60 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
1187.29316.020.150.15
2194.001993.440.040.05
4613.772352.950.030.08
8653.825531.070.050.07
162119.645437.470.030.14
323341.718709.830.090.17
644112.559600.500.060.30
Input Tokens: 5581
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.60 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
1129.8123107.580.240.24
2133.1823550.330.280.48
4124.4424315.440.280.93
8375.8623998.110.301.87
16434.1824096.610.333.74
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.80 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
1154.772238.660.020.02
2157.462653.180.020.04
4309.742791.230.020.07
8578.126417.180.020.06
161439.226484.350.010.11
322814.979551.400.020.15
644251.2310272.750.020.28
Input Tokens: 5657
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 3.00 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
1123.7319862.840.290.29
2107.3621603.390.290.53
4134.8522812.970.310.99
8335.8023905.970.301.89
16288.9724434.420.343.68

2080ti

vllm xformer
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
1228.981092.870.040.04
2423.881288.860.070.07
4653.952368.370.070.08
81051.374368.510.080.08
162127.286250.580.070.12
323203.858748.440.100.17
643942.4610506.130.090.28
1284603.7318207.470.090.32
Input Tokens: 5581
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
127.1529227.070.190.19
2171.7931110.000.230.36
4340.5934868.820.240.65
8213.4436036.220.251.24
16377.3336375.680.322.48
lmdeploy turbomind
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
1305.944199.330.010.01
2341.576731.390.010.02
4911.327832.420.010.02
81551.957321.060.020.05
163052.745795.360.010.13
324666.9013364.550.010.11
647360.4713220.150.020.22
1289013.0514792.200.020.39
Input Tokens: 5622
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.40 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
1244.8118598.960.310.31
2271.7928980.700.290.39
4158.4130582.230.260.73
8659.5235179.130.261.28
16501.5739903.700.302.26