LLM API Performance Evaluation Tool Guide

This article provides a quick start guide for Linux and Windows platforms, including command examples for downloading, configuring, and running the tool.
This tool is used to benchmark LLM API performance, including prefill speed, decoding speed, Time-to-First-Token (TTFT), and overall latency. The code has been open-sourced on github. This article primarily introduces how to use the tool.
Project Repository: https://github.com/Yoosu-L/llmapibenchmark
🚀 Quick Start
🐧 Linux
Download and extract the package
wget https://github.com/Yoosu-L/llmapibenchmark/releases/latest/download/llmapibenchmark_linux_amd64.tar.gz
tar -zxvf llmapibenchmark_linux_amd64.tar.gz
This command primarily tests the maximum achievable decoding speed (generation throughput) in a short-input, long-output scenario. Concurrency defaults from 1 to 128.
# Replace `base-url` with your API service URL (ending with /v1).
./llmapibenchmark_linux_amd64 --base-url https://your-api-endpoint.com/v1
Next, test the maximum prefill speed in a long-input, long-output scenario. You can adjust the
num-words
andconcurrency
levels as needed.
# Replace `base-url` with your API service URL (ending with /v1).
./llmapibenchmark_linux_amd64 --base-url https://your-api-endpoint.com/v1 --num-words 6000 --concurrency 1,2,4,8,16,32
🪟 Windows
Download the latest version from the releases page.
After extracting, you will get llmapibenchmark_windows_amd64.exe
.
This command primarily tests the maximum achievable decoding speed (generation throughput) in a short-input, long-output scenario. Concurrency defaults from 1 to 128.
# Replace `base-url` with your API service URL (ending with /v1).
llmapibenchmark_windows_amd64.exe --base-url https://your-api-endpoint.com/v1
Next, test the maximum prefill speed in a long-input, long-output scenario. You can adjust the
num-words
andconcurrency
levels as needed.
# Replace `base-url` with your API service URL (ending with /v1).
llmapibenchmark_windows_amd64.exe --base-url https://your-api-endpoint.com/v1 --num-words 6000 --concurrency 1,2,4,8,16,32
📊 Example Output
⌨️ Real-time Terminal Output
################################################################################################################
LLM API Throughput Benchmark
https://github.com/Yoosu-L/llmapibenchmark
Time:2024-12-03 03:11:48 UTC+0
################################################################################################################
Input Tokens: 45
Output Tokens: 512
Test Model: qwen2.5:0.5b
Latency: 0.00 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|-------------|----------------------------------|-------------------------------|--------------|--------------|
| 1 | 31.88 | 976.60 | 0.05 | 0.05 |
| 2 | 30.57 | 565.40 | 0.07 | 0.16 |
| 4 | 31.00 | 717.96 | 0.11 | 0.25 |
📄 Markdown File
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-7B-Instruct-AWQ
Latency: 2.20 ms
Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
---|---|---|---|---|
1 | 58.49 | 846.81 | 0.05 | 0.05 |
2 | 114.09 | 989.94 | 0.08 | 0.09 |
4 | 222.62 | 1193.99 | 0.11 | 0.15 |
8 | 414.35 | 1479.76 | 0.11 | 0.24 |
16 | 752.26 | 1543.29 | 0.13 | 0.47 |
32 | 653.94 | 1625.07 | 0.14 | 0.89 |
📝 Json (enable with --format json
)
{
"model_name": "Qwen2.5-7B-Instruct-AWQ",
"input_tokens": 32,
"output_tokens": 512,
"latency": 1,
"results": [
{
"concurrency": 1,
"generation_speed": 118.18,
"prompt_throughput": 42.69,
"max_ttft": 0.61,
"min_ttft": 0.61
},
{
"concurrency": 2,
"generation_speed": 214.37,
"prompt_throughput": 48.64,
"max_ttft": 1.07,
"min_ttft": 0.42
}
]
}
🗒️ Yaml (enable with --format yaml
)
model-name: Qwen2.5-7B-Instruct-AWQ
input-tokens: 32
output-tokens: 512
latency: 1.6
results:
- concurrency: 1
generation-speed: 134.28
prompt-throughput: 59.31
max-ttft: 0.44
min-ttft: 0.44
- concurrency: 2
generation-speed: 221.23
prompt-throughput: 51.06
max-ttft: 1.02
min-ttft: 0.47
⚙️ Advanced Parameters
Linux:
./llmapibenchmark_linux_amd64 \
--base-url https://your-api-endpoint.com/v1 \
--api-key YOUR_API_KEY \
--model gpt-3.5-turbo \
--concurrency 1,2,4,8,16 \
--max-tokens 512 \
--num-words 513 \
--prompt "Your custom prompt here" \
--format json
Windows:
llmapibenchmark_windows_amd64.exe ^
--base-url https://your-api-endpoint.com/v1 ^
--api-key YOUR_API_KEY ^
--model gpt-3.5-turbo ^
--concurrency 1,2,4,8,16 ^
--max-tokens 512 ^
--num-words 513 ^
--prompt "Your custom prompt here" ^
--format json
📋 Parameter Description
Parameter | Short | Description | Default | Required |
---|---|---|---|---|
--base-url | -u | Base URL for LLM API endpoint | Empty (MUST be specified) | Yes |
--api-key | -k | API authentication key | None | No |
--model | -m | Specific AI model to test | Automatically discovers first available model | No |
--concurrency | -c | Comma-separated concurrency levels to test | 1,2,4,8,16,32,64,128 | No |
--max-tokens | -t | Maximum tokens to generate per request | 512 | No |
--num-words | -n | Number of words for random input prompt | 0 | No |
--prompt | -p | Text prompt for generating responses | A long story | No |
--format | -f | Output format (json, yaml) | "" | No |
--help | -h | Show help message | false | No |