Contents

LLM API Performance Evaluation Tool Guide

This article provides a quick start guide for Linux and Windows platforms, including command examples for downloading, configuring, and running the tool.

This tool is used to benchmark LLM API performance, including prefill speed, decoding speed, Time-to-First-Token (TTFT), and overall latency. The code has been open-sourced on github. This article primarily introduces how to use the tool.

Project Repository: https://github.com/Yoosu-L/llmapibenchmark

🚀 Quick Start

🐧 Linux

Download and extract the package

wget https://github.com/Yoosu-L/llmapibenchmark/releases/latest/download/llmapibenchmark_linux_amd64.tar.gz
tar -zxvf llmapibenchmark_linux_amd64.tar.gz

This command primarily tests the maximum achievable decoding speed (generation throughput) in a short-input, long-output scenario. Concurrency defaults from 1 to 128.

# Replace `base-url` with your API service URL (ending with /v1).
./llmapibenchmark_linux_amd64 --base-url https://your-api-endpoint.com/v1

Next, test the maximum prefill speed in a long-input, long-output scenario. You can adjust the num-words and concurrency levels as needed.

# Replace `base-url` with your API service URL (ending with /v1).
./llmapibenchmark_linux_amd64 --base-url https://your-api-endpoint.com/v1 --num-words 6000 --concurrency 1,2,4,8,16,32

🪟 Windows

Download the latest version from the releases page.

After extracting, you will get llmapibenchmark_windows_amd64.exe.

This command primarily tests the maximum achievable decoding speed (generation throughput) in a short-input, long-output scenario. Concurrency defaults from 1 to 128.

# Replace `base-url` with your API service URL (ending with /v1).
llmapibenchmark_windows_amd64.exe --base-url https://your-api-endpoint.com/v1

Next, test the maximum prefill speed in a long-input, long-output scenario. You can adjust the num-words and concurrency levels as needed.

# Replace `base-url` with your API service URL (ending with /v1).
llmapibenchmark_windows_amd64.exe --base-url https://your-api-endpoint.com/v1 --num-words 6000 --concurrency 1,2,4,8,16,32

📊 Example Output

⌨️ Real-time Terminal Output

################################################################################################################
                                          LLM API Throughput Benchmark
                                    https://github.com/Yoosu-L/llmapibenchmark
                                         Time:2024-12-03 03:11:48 UTC+0
################################################################################################################
Input Tokens: 45
Output Tokens: 512
Test Model: qwen2.5:0.5b
Latency: 0.00 ms

| Concurrency | Generation Throughput (tokens/s) |  Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|-------------|----------------------------------|-------------------------------|--------------|--------------|
|           1 |                            31.88 |                        976.60 |         0.05 |         0.05 |
|           2 |                            30.57 |                        565.40 |         0.07 |         0.16 |
|           4 |                            31.00 |                        717.96 |         0.11 |         0.25 |

📄 Markdown File

Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-7B-Instruct-AWQ
Latency: 2.20 ms
ConcurrencyGeneration Throughput (tokens/s)Prompt Throughput (tokens/s)Min TTFT (s)Max TTFT (s)
158.49846.810.050.05
2114.09989.940.080.09
4222.621193.990.110.15
8414.351479.760.110.24
16752.261543.290.130.47
32653.941625.070.140.89

📝 Json (enable with --format json)

{
    "model_name": "Qwen2.5-7B-Instruct-AWQ",
    "input_tokens": 32,
    "output_tokens": 512,
    "latency": 1,
    "results": [
        {
            "concurrency": 1,
            "generation_speed": 118.18,
            "prompt_throughput": 42.69,
            "max_ttft": 0.61,
            "min_ttft": 0.61
        },
        {
            "concurrency": 2,
            "generation_speed": 214.37,
            "prompt_throughput": 48.64,
            "max_ttft": 1.07,
            "min_ttft": 0.42
        }
    ]
}

🗒️ Yaml (enable with --format yaml)

model-name: Qwen2.5-7B-Instruct-AWQ
input-tokens: 32
output-tokens: 512
latency: 1.6
results:
    - concurrency: 1
      generation-speed: 134.28
      prompt-throughput: 59.31
      max-ttft: 0.44
      min-ttft: 0.44
    - concurrency: 2
      generation-speed: 221.23
      prompt-throughput: 51.06
      max-ttft: 1.02
      min-ttft: 0.47

⚙️ Advanced Parameters

Linux:

./llmapibenchmark_linux_amd64 \
  --base-url https://your-api-endpoint.com/v1 \
  --api-key YOUR_API_KEY \
  --model gpt-3.5-turbo \
  --concurrency 1,2,4,8,16 \
  --max-tokens 512 \
  --num-words 513 \
  --prompt "Your custom prompt here" \
  --format json

Windows:

llmapibenchmark_windows_amd64.exe ^
  --base-url https://your-api-endpoint.com/v1 ^
  --api-key YOUR_API_KEY ^
  --model gpt-3.5-turbo ^
  --concurrency 1,2,4,8,16 ^
  --max-tokens 512 ^
  --num-words 513 ^
  --prompt "Your custom prompt here" ^
  --format json

📋 Parameter Description

ParameterShortDescriptionDefaultRequired
--base-url-uBase URL for LLM API endpointEmpty (MUST be specified)Yes
--api-key-kAPI authentication keyNoneNo
--model-mSpecific AI model to testAutomatically discovers first available modelNo
--concurrency-cComma-separated concurrency levels to test1,2,4,8,16,32,64,128No
--max-tokens-tMaximum tokens to generate per request512No
--num-words-nNumber of words for random input prompt0No
--prompt-pText prompt for generating responsesA long storyNo
--format-fOutput format (json, yaml)""No
--help-hShow help messagefalseNo