目录

大模型推理引擎吞吐速度对比:VLLM | SGLang | LMDeploy

本文对比了 VLLM, SGLang 和 LMDeploy 三个大模型推理引擎在短输入长输出场景下的吞吐速度,单位为输出 token/s。

简单对比3个大模型推理引擎吞吐速度,单位为输出token/s,短输入长输出场景,其余参数见表后

VLLM | SGLang | LMDeploy

ConcurrencyVLLM 0.6.1.post2VLLM 0.6.3.post1LMDeploy 0.6.0a0LMDeploy 0.6.2SGLang 0.3.4.post2SGLang 0.3.4.post2(–disable-cuda-graph)
128.7328.7656.1957.2437.2329.96
271.5373.26113.12113.4873.5958.28
4133.38136.05205.51199.01136.73111.24
8246.14251.59398.73393.48258.21215.53
16394.25401.67704.69709.27461.89444.48
32480.26481.75967.34973.24562.36557.93
64520.11526.011119.221123.07594.03602.36
128479.02481.63989.14890.44534.69582.97
  • 测试模型:Qwen2.5-14B-Instruct-AWQ
  • 硬件环境:E5 2680v4 + 2080ti 22G * 1

/images/Pasted%20image%2020241123103935.png
/images/Pasted%20image%2020241123103943.png