LLM API Performance Evaluation Tool Guide

Contents
This article provides a quick start guide for Linux and Windows platforms, including command examples for downloading, configuring, and running the tool.
This tool is used to benchmark LLM API performance, including prefill speed, decoding speed, Time-to-First-Token (TTFT), and overall latency. The code has been open-sourced on github. This article primarily introduces how to use the tool.
Project Repository: https://github.com/Yoosu-L/llmapibenchmark
Quick Start
Linux
# Download and grant execute permissions
wget https://github.com/Yoosu-L/llmapibenchmark/releases/download/v1.0.2/llmapibenchmark_linux_amd64
chmod +x ./llmapibenchmark_linux_amd64
# Replace base_url with your API service URL (ending with /v1)
# This command primarily tests the maximum achievable decoding speed (generation throughput) in a short-input, long-output scenario. Concurrency defaults from 1 -> 128.
./llmapibenchmark_linux_amd64 -base_url=https://your-api-endpoint.com/v1
# Replace base_url with your API service URL (ending with /v1)
# This command primarily tests the maximum achievable prefill speed in a long-input, long-output scenario. Adjust numWords and concurrency as needed.
./llmapibenchmark_linux_amd64 -base_url=https://your-api-endpoint.com/v1 -numWords=6000 -concurrency=1,2,4,8,16,32
Windows
Download the latest version from the releases page.
# Replace base_url with your API service URL (ending with /v1)
# This command primarily tests the maximum achievable decoding speed (generation throughput) in a short-input, long-output scenario. Concurrency defaults from 1 -> 128.
llmapibenchmark_windows_amd64.exe -base_url=https://your-api-endpoint.com/v1
# Replace base_url with your API service URL (ending with /v1)
# This command primarily tests the maximum achievable prefill speed in a long-input, long-output scenario. Adjust numWords and concurrency as needed.
llmapibenchmark_windows_amd64.exe -base_url=https://your-api-endpoint.com/v1 -numWords=6000 -concurrency=1,2,4,8,16,32
Example Output
Real-time Terminal Output
################################################################################################################
LLM API Throughput Benchmark
https://github.com/Yoosu-L/llmapibenchmark
Time:2024-12-03 03:11:48 UTC+0
################################################################################################################
Input Tokens: 45
Output Tokens: 512
Test Model: qwen2.5:0.5b
Latency: 0.00 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|-------------|----------------------------------|-------------------------------|--------------|--------------|
| 1 | 31.88 | 976.60 | 0.05 | 0.05 |
| 2 | 30.57 | 565.40 | 0.07 | 0.16 |
| 4 | 31.00 | 717.96 | 0.11 | 0.25 |
Markdown File
Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-7B-Instruct-AWQ
Latency: 2.20 ms
Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
---|---|---|---|---|
1 | 58.49 | 846.81 | 0.05 | 0.05 |
2 | 114.09 | 989.94 | 0.08 | 0.09 |
4 | 222.62 | 1193.99 | 0.11 | 0.15 |
8 | 414.35 | 1479.76 | 0.11 | 0.24 |
16 | 752.26 | 1543.29 | 0.13 | 0.47 |
32 | 653.94 | 1625.07 | 0.14 | 0.89 |
Advanced Parameters
Linux:
./llmapibenchmark_linux_amd64 \
-base_url=https://your-api-endpoint.com/v1 \
-apikey=YOUR_API_KEY \
-model=gpt-3.5-turbo \
-concurrency=1,2,4,8,16 \
-max_tokens=512 \
-numWords=513 \
-prompt="Your custom prompt here"
Windows:
llmapibenchmark_windows_amd64.exe ^
-base_url=https://your-api-endpoint.com/v1 ^
-apikey=YOUR_API_KEY ^
-model=gpt-3.5-turbo ^
-concurrency=1,2,4,8,16 ^
-max_tokens=512 ^
-numWords=513 ^
-prompt="Your custom prompt here"
Parameter Description
Parameter | Description | Default | Required |
---|---|---|---|
-base_url |
Base URL for LLM API endpoint | Empty (MUST be specified) | Yes |
-apikey |
API authentication key | None | No |
-model |
Specific AI model to test | Automatically discovers first available model | No |
-concurrency |
Comma-separated concurrency levels to test | 1,2,4,8,16,32,64,128 |
No |
-max_tokens |
Maximum tokens to generate per request | 512 |
No |
-numWords |
Number of words for input prompt | Not set (optional) | No |
-prompt |
Text prompt for generating responses | "Write a long story, no less than 10,000 words, starting from a long, long time ago." |
No |