Contents

LLM API Performance Evaluation Tool Guide

This article provides a quick start guide for Linux and Windows platforms, including command examples for downloading, configuring, and running the tool.

This tool is used to benchmark LLM API performance, including prefill speed, decoding speed, Time-to-First-Token (TTFT), and overall latency. The code has been open-sourced on github. This article primarily introduces how to use the tool.

Project Repository: https://github.com/Yoosu-L/llmapibenchmark

Quick Start

Linux

# Download and grant execute permissions
wget https://github.com/Yoosu-L/llmapibenchmark/releases/download/v1.0.2/llmapibenchmark_linux_amd64
chmod +x ./llmapibenchmark_linux_amd64
# Replace base_url with your API service URL (ending with /v1)
# This command primarily tests the maximum achievable decoding speed (generation throughput) in a short-input, long-output scenario. Concurrency defaults from 1 -> 128.
./llmapibenchmark_linux_amd64 -base_url=https://your-api-endpoint.com/v1
# Replace base_url with your API service URL (ending with /v1)
# This command primarily tests the maximum achievable prefill speed in a long-input, long-output scenario. Adjust numWords and concurrency as needed.
./llmapibenchmark_linux_amd64 -base_url=https://your-api-endpoint.com/v1 -numWords=6000 -concurrency=1,2,4,8,16,32

Windows

Download the latest version from the releases page.

# Replace base_url with your API service URL (ending with /v1)
# This command primarily tests the maximum achievable decoding speed (generation throughput) in a short-input, long-output scenario. Concurrency defaults from 1 -> 128.
llmapibenchmark_windows_amd64.exe -base_url=https://your-api-endpoint.com/v1
# Replace base_url with your API service URL (ending with /v1)
# This command primarily tests the maximum achievable prefill speed in a long-input, long-output scenario. Adjust numWords and concurrency as needed.
llmapibenchmark_windows_amd64.exe -base_url=https://your-api-endpoint.com/v1 -numWords=6000 -concurrency=1,2,4,8,16,32

Example Output

Real-time Terminal Output

################################################################################################################
                                          LLM API Throughput Benchmark
                                    https://github.com/Yoosu-L/llmapibenchmark
                                         Time:2024-12-03 03:11:48 UTC+0
################################################################################################################
Input Tokens: 45
Output Tokens: 512
Test Model: qwen2.5:0.5b
Latency: 0.00 ms

| Concurrency | Generation Throughput (tokens/s) |  Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|-------------|----------------------------------|-------------------------------|--------------|--------------|
|           1 |                            31.88 |                        976.60 |         0.05 |         0.05 |
|           2 |                            30.57 |                        565.40 |         0.07 |         0.16 |
|           4 |                            31.00 |                        717.96 |         0.11 |         0.25 |

Markdown File

Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-7B-Instruct-AWQ
Latency: 2.20 ms
Concurrency Generation Throughput (tokens/s) Prompt Throughput (tokens/s) Min TTFT (s) Max TTFT (s)
1 58.49 846.81 0.05 0.05
2 114.09 989.94 0.08 0.09
4 222.62 1193.99 0.11 0.15
8 414.35 1479.76 0.11 0.24
16 752.26 1543.29 0.13 0.47
32 653.94 1625.07 0.14 0.89

Advanced Parameters

Linux:

./llmapibenchmark_linux_amd64 \
  -base_url=https://your-api-endpoint.com/v1 \
  -apikey=YOUR_API_KEY \
  -model=gpt-3.5-turbo \
  -concurrency=1,2,4,8,16 \
  -max_tokens=512 \
  -numWords=513 \
  -prompt="Your custom prompt here"

Windows:

llmapibenchmark_windows_amd64.exe ^
  -base_url=https://your-api-endpoint.com/v1 ^
  -apikey=YOUR_API_KEY ^
  -model=gpt-3.5-turbo ^
  -concurrency=1,2,4,8,16 ^
  -max_tokens=512 ^
  -numWords=513 ^
  -prompt="Your custom prompt here"

Parameter Description

Parameter Description Default Required
-base_url Base URL for LLM API endpoint Empty (MUST be specified) Yes
-apikey API authentication key None No
-model Specific AI model to test Automatically discovers first available model No
-concurrency Comma-separated concurrency levels to test 1,2,4,8,16,32,64,128 No
-max_tokens Maximum tokens to generate per request 512 No
-numWords Number of words for input prompt Not set (optional) No
-prompt Text prompt for generating responses "Write a long story, no less than 10,000 words, starting from a long, long time ago." No