Re-evaluating: The True Power of Flash Attention 2

Loouis included in AI

2025-02-08 1404 words 7 minutes

Contents

This article compares the performance differences of vllm when using xformers and flash attention 2 as the backend attention mechanism.

Previously, I wrote an article comparing the performance differences of vllm when using xformers and flash attention 2 as backend attention mechanisms. However, after a few days of reconsideration, I felt something was off. So, I conducted several more tests and discovered that flash attention 2 does indeed provide a significant boost. Even with a bandwidth difference of over 60%, the acceleration from flash attention 2 can forcefully equalize the generation throughput. This means that the improvement over xformers is at least 40%, and in some scenarios, it can reach 70%. Hence, this article presents several conclusions without extensive analysis:

For the 30 series, changing the vllm environment variable VLLM_ATTENTION_BACKEND=XFORMERS might not actually use xformers, as the speed is consistent with fa2. However, there is a significant improvement compared to the 2080ti.
Using the lmdeploy turbomind inference engine on the 2080ti can achieve the same performance boost as fa2.
The lmdeploy results on the 2080ti differ from previous tests, showing significant improvements in both awq and fp16.
However, the 3070 shows different behavior compared to the above, so the 30 series (and presumably the 40 and 50 series) should use vllm, as vllm seems to have specific optimizations for the 30 series and later.
Depending on parameter ratios, fa2 provides an improvement of approximately 40-70% (for fp16 and awq quantization, respectively).

Actual hardware parameters and single-request generation throughput extracted from the test results below:

	tflops	bandwidth	vllm	lmdeploy	vllm	lmdeploy
3070laptop	29.7	319	100	102	191	154
2080ti	52.5	526	97	176	228	305
ratio		1.65	0.97	1.73	1.19	1.98

Test Results

AWQ Test Results

3070 laptop

vllm xformer

Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.80 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	100.00	951.17	0.05	0.05
2	196.12	885.67	0.10	0.10
4	378.49	1568.36	0.10	0.12
8	714.94	2434.35	0.13	0.15
16	1211.09	2184.59	0.17	0.33
32	1682.82	3098.94	0.14	0.47

Input Tokens: 5638
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.20 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	72.75	3724.02	1.52	1.52
2	65.69	3461.59	2.03	3.24
4	118.01	3355.38	2.01	6.72
8	129.15	3542.59	1.94	12.66

vllm fa2

Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.40 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	101.52	1106.82	0.04	0.04
2	199.29	1326.93	0.07	0.07
4	397.01	2322.60	0.08	0.08
8	744.77	2998.94	0.04	0.12
16	1280.79	3332.02	0.08	0.22
32	1880.32	3579.14	0.05	0.40

Input Tokens: 5519
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 3.20 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	65.94	2966.57	1.89	1.89
2	113.30	3612.69	1.85	3.11
4	58.97	3635.87	1.85	6.16
8	53.07	3576.62	1.87	12.53

lmdeploy turbomind

Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 3.20 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	102.33	1162.42	0.04	0.04
2	202.70	1441.08	0.03	0.07
4	417.91	1724.37	0.03	0.11
8	793.40	2305.94	0.03	0.16
16	1365.04	3255.68	0.03	0.22
32	1939.21	3398.90	0.03	0.43

Input Tokens: 5622
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 2.80 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	26.89	4419.59	1.29	1.29
2	129.84	4403.95	1.29	2.52
4	145.21	4466.76	1.30	5.02
8	172.47	4586.10	1.33	9.78

2080ti

vllm xformer

Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	97.29	716.20	0.06	0.06
2	189.25	787.34	0.11	0.12
4	368.60	1630.50	0.11	0.11
8	673.36	2310.67	0.11	0.16
16	1132.99	2949.96	0.12	0.25
32	1561.65	4154.17	0.12	0.35
64	1653.97	4995.70	0.14	0.58
128	1795.95	5930.27	0.13	0.97

Input Tokens: 5648
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	64.25	6055.06	0.93	0.93
2	49.60	6011.82	0.99	1.84
4	187.93	6402.48	1.48	3.52
8	159.36	6402.51	5.11	7.04
16	178.05	6434.66	4.73	13.92

lmdeploy turbomind

Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	176.51	2713.72	0.02	0.02
2	350.82	3556.20	0.02	0.03
4	686.61	3581.65	0.02	0.05
8	1327.60	4906.68	0.02	0.08
16	2284.32	4964.15	0.02	0.15
32	3307.78	6052.02	0.02	0.24
64	3933.59	5430.29	0.01	0.53
128	4333.86	6620.12	0.01	0.87

Input Tokens: 5539
Output Tokens: 512
Test Model: Qwen2.5-3B-Instruct-AWQ
Latency: 1.80 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	61.08	5805.73	0.97	0.97
2	98.14	6489.22	0.91	1.73
4	185.56	6924.31	0.92	3.24
8	195.81	7054.24	0.96	6.37
16	257.38	7292.93	0.95	12.28

FP16 Test Results

3070 laptop

vllm xformer

Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 4.00 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	191.58	396.31	0.12	0.12
2	188.85	1373.13	0.07	0.07
4	567.70	2502.72	0.07	0.08
8	1079.86	4669.36	0.07	0.08
16	1677.36	5520.68	0.06	0.13
32	2931.42	8041.26	0.13	0.18
64	4236.04	9555.88	0.11	0.31

Input Tokens: 5633
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.60 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	103.62	23042.82	0.25	0.25
2	50.94	23025.27	0.28	0.49
4	194.34	23268.16	0.32	0.97
8	369.72	23895.54	0.32	1.88
16	563.25	24113.27	0.34	3.74

vllm fa2

Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 3.60 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	187.29	316.02	0.15	0.15
2	194.00	1993.44	0.04	0.05
4	613.77	2352.95	0.03	0.08
8	653.82	5531.07	0.05	0.07
16	2119.64	5437.47	0.03	0.14
32	3341.71	8709.83	0.09	0.17
64	4112.55	9600.50	0.06	0.30

Input Tokens: 5581
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.60 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	129.81	23107.58	0.24	0.24
2	133.18	23550.33	0.28	0.48
4	124.44	24315.44	0.28	0.93
8	375.86	23998.11	0.30	1.87
16	434.18	24096.61	0.33	3.74

lmdeploy turbomind

Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 2.80 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	154.77	2238.66	0.02	0.02
2	157.46	2653.18	0.02	0.04
4	309.74	2791.23	0.02	0.07
8	578.12	6417.18	0.02	0.06
16	1439.22	6484.35	0.01	0.11
32	2814.97	9551.40	0.02	0.15
64	4251.23	10272.75	0.02	0.28

Input Tokens: 5657
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 3.00 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	123.73	19862.84	0.29	0.29
2	107.36	21603.39	0.29	0.53
4	134.85	22812.97	0.31	0.99
8	335.80	23905.97	0.30	1.89
16	288.97	24434.42	0.34	3.68

2080ti

vllm xformer

Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	228.98	1092.87	0.04	0.04
2	423.88	1288.86	0.07	0.07
4	653.95	2368.37	0.07	0.08
8	1051.37	4368.51	0.08	0.08
16	2127.28	6250.58	0.07	0.12
32	3203.85	8748.44	0.10	0.17
64	3942.46	10506.13	0.09	0.28
128	4603.73	18207.47	0.09	0.32

Input Tokens: 5581
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	27.15	29227.07	0.19	0.19
2	171.79	31110.00	0.23	0.36
4	340.59	34868.82	0.24	0.65
8	213.44	36036.22	0.25	1.24
16	377.33	36375.68	0.32	2.48

lmdeploy turbomind

Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.80 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	305.94	4199.33	0.01	0.01
2	341.57	6731.39	0.01	0.02
4	911.32	7832.42	0.01	0.02
8	1551.95	7321.06	0.02	0.05
16	3052.74	5795.36	0.01	0.13
32	4666.90	13364.55	0.01	0.11
64	7360.47	13220.15	0.02	0.22
128	9013.05	14792.20	0.02	0.39

Input Tokens: 5622
Output Tokens: 512
Test Model: Qwen2.5-0.5B-Instruct
Latency: 1.40 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	244.81	18598.96	0.31	0.31
2	271.79	28980.70	0.29	0.39
4	158.41	30582.23	0.26	0.73
8	659.52	35179.13	0.26	1.28
16	501.57	39903.70	0.30	2.26