Contents

Large Language Model Inference Framework Throughput Comparison: VLLM | SGLang | LMDeploy

This article compares the throughput of three large language model inference engines, VLLM, SGLang, and LMDeploy, in a short-input, long-output scenario. The unit of measurement is output tokens per second.

A simple comparison of the throughput of three large language model inference engines, measured in output tokens per second, in a short-input, long-output scenario. See the table below for other parameters.

VLLM | SGLang | LMDeploy

ConcurrencyVLLM 0.6.1.post2VLLM 0.6.3.post1LMDeploy 0.6.0a0LMDeploy 0.6.2SGLang 0.3.4.post2SGLang 0.3.4.post2(–disable-cuda-graph)
128.7328.7656.1957.2437.2329.96
271.5373.26113.12113.4873.5958.28
4133.38136.05205.51199.01136.73111.24
8246.14251.59398.73393.48258.21215.53
16394.25401.67704.69709.27461.89444.48
32480.26481.75967.34973.24562.36557.93
64520.11526.011119.221123.07594.03602.36
128479.02481.63989.14890.44534.69582.97
  • Test Model: Qwen2.5-14B-Instruct-AWQ
  • Hardware: E5 2680v4 + 2080ti 22G * 1

/ob/static/images/Pasted%20image%2020241123103935.png
/ob/static/images/Pasted%20image%2020241123103943.png